Every morning at 7:00 AM, a script wakes up, scrapes the top stories from Hacker News, pulls the week's hottest GitHub repositories, runs a trend analysis, generates a formatted Markdown digest, and publishes it to a GitHub Pages website — all without me touching a keyboard.
I've been running this for a few weeks. Here's exactly how it works, why I built it this way, and what I'd do differently. All code is open source.
The Architecture
Four scripts. Each one does exactly one thing. They chain together:
No framework. No database. No dependencies outside the standard library. Everything is just files on disk, connected by Python.
Step 1: Fetching Data
Hacker News
HN has a free public Firebase API that's surprisingly well-structured. I fetch the top story IDs, then pull details for each one:
def fetch_hn_stories(limit=15):
url = "https://hacker-news.firebaseio.com/v0/topstories.json"
with urllib.request.urlopen(url, timeout=10) as r:
story_ids = json.loads(r.read())[:limit * 2] # fetch extra, filter later
stories = []
for sid in story_ids:
if len(stories) >= limit:
break
detail_url = f"https://hacker-news.firebaseio.com/v0/item/{sid}.json"
with urllib.request.urlopen(detail_url, timeout=5) as r:
item = json.loads(r.read())
if item.get("type") == "story" and item.get("url"):
stories.append({
"id": sid,
"title": item.get("title", ""),
"url": item.get("url", ""),
"score": item.get("score", 0),
"descendants": item.get("descendants", 0),
"by": item.get("by", ""),
})
return sorted(stories, key=lambda x: x["score"], reverse=True)
Python
No auth required. I fetch 2× the target count because some items are "Ask HN" or "Show HN" posts without a URL, which I skip.
GitHub Trending
GitHub doesn't have an official "trending" API, but the search API with
created:>YYYY-MM-DD + sort=stars gives a close approximation:
def fetch_github_trending(days=7, limit=10):
since = (datetime.now() - timedelta(days=days)).strftime("%Y-%m-%d")
params = urllib.parse.urlencode({
"q": f"created:>{since}",
"sort": "stars",
"order": "desc",
"per_page": limit
})
url = f"https://api.github.com/search/repositories?{params}"
req = urllib.request.Request(url, headers={
"User-Agent": "content-producer/1.0",
"Accept": "application/vnd.github.v3+json"
})
with urllib.request.urlopen(req, timeout=15) as r:
data = json.loads(r.read())
Python
Rate limits: GitHub's unauthenticated search API allows 10 requests/minute.
Running once daily is nowhere near that. If you scale up, add a token via the
Authorization header for 30 req/min.
Step 2: Trend Analysis
analyzer.py does three things: keyword extraction, category classification, and
cross-source insight generation.
The category classifier is a hand-rolled lookup against a keyword dict — nothing fancy, but it works for the use case:
TECH_CATEGORIES = {
"AI/LLM": ["ai", "llm", "gpt", "claude", "agent", "rag", "transformer", ...],
"Infrastructure": ["kubernetes", "docker", "terraform", "serverless", ...],
"Security": ["security", "vulnerability", "cve", "exploit", ...],
# ...
}
def categorize_keywords(keywords):
category_scores = Counter()
keyword_list = [kw.lower() for kw, _ in keywords]
for category, terms in TECH_CATEGORIES.items():
for term in terms:
for kw in keyword_list:
if term in kw or kw in term:
category_scores[category] += 1
return category_scores.most_common()
Python
The output is a JSON file per day. Example for March 23, 2026:
{
"date": "2026-03-23",
"hn": {
"total": 15,
"avg_score": 130.1,
"hot_categories": [["Security", 8], ["AI/LLM", 6], ["Infrastructure", 3]],
"top_keywords": [["cloudflare", 4], ["security", 3], ["windows", 3], ...]
},
"github": {
"top_languages": [["Python", 4], ["TypeScript", 3], ...],
"top_repos": [{"name": "HKUDS/ClawTeam", "stars": 2793, ...}]
},
"insights": [
"HN community discussion focused on Security / AI/LLM",
"Most-starred new project: HKUDS/ClawTeam (2,793 ⭐)"
]
}
JSON
Step 3: Publishing to GitHub Pages
The final step copies the generated data files to a separate github-pages repo,
commits, and pushes. The GitHub Pages site is a static HTML file that loads the JSON
client-side with vanilla JS — no build step, no bundler.
def publish_to_github_pages():
# Copy data files
for json_file in (DATA_DIR / "docs" / "data").glob("*.json"):
shutil.copy(json_file, GITHUB_PAGES_DATA_DIR / json_file.name)
# Git operations in the github-pages directory
run_command("git add data/", cwd=GITHUB_PAGES_DIR)
run_command(f'git commit -m "auto: update data {today}"', cwd=GITHUB_PAGES_DIR)
run_command("git push", cwd=GITHUB_PAGES_DIR)
Python
SSH tip: If your git remote is set up with SSH
(git@github.com:...), make sure your deploy machine has the SSH key loaded.
On macOS, add it to the keychain: ssh-add --apple-use-keychain ~/.ssh/id_ed25519
The Frontend: Vanilla JS on GitHub Pages
The daily.html page fetches a JSON index of all available reports, then loads
the selected one on demand. No React, no dependencies — just fetch():
// Load report index
const reports = await fetch('/data/index.json').then(r => r.json());
// Render the latest one
const latest = reports[0];
const data = await fetch(`/data/${latest.date}.json`).then(r => r.json());
renderDigest(data);
JavaScript
The whole site is ~400 lines of HTML+CSS+JS with zero build pipeline.
Deploys in under 5 seconds via git push.
Scheduling: WorkBuddy Automations
I run this on a 2014 MacBook Pro using WorkBuddy's built-in automation scheduler. The RRULE is simple:
[automation]
name = "daily-content-gen"
rrule = "FREQ=WEEKLY;BYDAY=MO,TU,WE,TH,FR,SA,SU;BYHOUR=7;BYMINUTE=0"
status = "ACTIVE"
[automation.prompt]
content = """
Run content production pipeline:
1. cd /path/to/content-producer
2. python3 generator.py
3. python3 analyzer.py
4. python3 publish_to_github_pages.py
5. git add + commit + push
"""
TOML
If you want a pure cron approach, the equivalent is:
0 7 * * * cd /path/to/content-producer && python3 generator.py && python3 analyzer.py && python3 publish_to_github_pages.py && git add . && git commit -m "auto: daily report $(date +%Y-%m-%d)" && git push
Bash
What I'd Do Differently
- Add a GitHub token — unauthenticated API calls work fine for once-daily runs, but GitHub will occasionally rate-limit you if the machine's IP is shared.
- Cache data locally — if the publish step fails, I lose the day's analysis. Writing to a local SQLite DB first would make retries trivial.
- Add Reddit/lobste.rs — HN skews toward certain topics. A second source would surface more diverse stories.
- LLM summarization — right now the "insights" are keyword-frequency heuristics. Passing the top stories to an LLM API would produce much better natural-language summaries.
Current state: The pipeline runs daily, produces consistent output, and costs $0/month (GitHub Pages is free, APIs are unauthenticated). It's not impressive engineering — it's just a boring, reliable script that does one thing well.
The Output
Every day, the pipeline produces something like this:
posts/2026-03-23.md— the formatted digest (committed to GitHub)data/analysis-2026-03-23.json— structured trend datahttps://citriac.github.io/daily.html— updated website
You can see the live output at citriac.github.io/daily.html. The full source is at github.com/citriac/content-producer.
Wrapping Up
This took a weekend to build. The hardest part wasn't the code — it was deciding what not to build. No database. No API. No framework. Just files and scripts.
The value isn't the tech — it's the compounding. Every morning there's a new digest. Every week there's more historical data. Every month the trend analysis gets more useful. Boring infrastructure is good infrastructure.