How I Built a Daily Tech Digest That Runs Itself

A fully automated pipeline: Hacker News + GitHub Trending → curated digest → GitHub Pages, every morning at 7am. Zero manual steps. All open source.

Every morning at 7:00 AM, a script wakes up, scrapes the top stories from Hacker News, pulls the week's hottest GitHub repositories, runs a trend analysis, generates a formatted Markdown digest, and publishes it to a GitHub Pages website — all without me touching a keyboard.

I've been running this for a few weeks. Here's exactly how it works, why I built it this way, and what I'd do differently. All code is open source.

~15 HN Stories / day

12 GitHub repos / day

0 Manual steps

~5KB Output / digest

The Architecture

Four scripts. Each one does exactly one thing. They chain together:

07:00 trigger

│

▼

generator.py ←── HN Firebase API + GitHub Search API

│ writes: posts/YYYY-MM-DD.md

│ data/hn-stories-*.json

│ data/github-trending-*.json

│

▼

analyzer.py ←── reads data/*.json

│ writes: data/analysis-*.json

│

▼

build_site.py ←── reads data/*.json + posts/*.md

│ writes: docs/data/*.json (for frontend)

│

▼

publish_to_github_pages.py ←── copies docs/data/ → ../github-pages/data/

│ runs: git add + commit + push

│

▼

https://citriac.github.io updated ✓

No framework. No database. No dependencies outside the standard library. Everything is just files on disk, connected by Python.

Step 1: Fetching Data

Hacker News

HN has a free public Firebase API that's surprisingly well-structured. I fetch the top story IDs, then pull details for each one:

def fetch_hn_stories(limit=15):
    url = "https://hacker-news.firebaseio.com/v0/topstories.json"
    with urllib.request.urlopen(url, timeout=10) as r:
        story_ids = json.loads(r.read())[:limit * 2]  # fetch extra, filter later

    stories = []
    for sid in story_ids:
        if len(stories) >= limit:
            break
        detail_url = f"https://hacker-news.firebaseio.com/v0/item/{sid}.json"
        with urllib.request.urlopen(detail_url, timeout=5) as r:
            item = json.loads(r.read())
            if item.get("type") == "story" and item.get("url"):
                stories.append({
                    "id": sid,
                    "title": item.get("title", ""),
                    "url": item.get("url", ""),
                    "score": item.get("score", 0),
                    "descendants": item.get("descendants", 0),
                    "by": item.get("by", ""),
                })

    return sorted(stories, key=lambda x: x["score"], reverse=True)

Python

No auth required. I fetch 2× the target count because some items are "Ask HN" or "Show HN" posts without a URL, which I skip.

GitHub Trending

GitHub doesn't have an official "trending" API, but the search API with created:>YYYY-MM-DD + sort=stars gives a close approximation:

def fetch_github_trending(days=7, limit=10):
    since = (datetime.now() - timedelta(days=days)).strftime("%Y-%m-%d")
    params = urllib.parse.urlencode({
        "q": f"created:>{since}",
        "sort": "stars",
        "order": "desc",
        "per_page": limit
    })
    url = f"https://api.github.com/search/repositories?{params}"
    req = urllib.request.Request(url, headers={
        "User-Agent": "content-producer/1.0",
        "Accept": "application/vnd.github.v3+json"
    })
    with urllib.request.urlopen(req, timeout=15) as r:
        data = json.loads(r.read())

Python

Rate limits: GitHub's unauthenticated search API allows 10 requests/minute. Running once daily is nowhere near that. If you scale up, add a token via the Authorization header for 30 req/min.

Step 2: Trend Analysis

analyzer.py does three things: keyword extraction, category classification, and cross-source insight generation.

The category classifier is a hand-rolled lookup against a keyword dict — nothing fancy, but it works for the use case:

TECH_CATEGORIES = {
    "AI/LLM": ["ai", "llm", "gpt", "claude", "agent", "rag", "transformer", ...],
    "Infrastructure": ["kubernetes", "docker", "terraform", "serverless", ...],
    "Security": ["security", "vulnerability", "cve", "exploit", ...],
    # ...
}

def categorize_keywords(keywords):
    category_scores = Counter()
    keyword_list = [kw.lower() for kw, _ in keywords]
    for category, terms in TECH_CATEGORIES.items():
        for term in terms:
            for kw in keyword_list:
                if term in kw or kw in term:
                    category_scores[category] += 1
    return category_scores.most_common()

Python

The output is a JSON file per day. Example for March 23, 2026:

{
  "date": "2026-03-23",
  "hn": {
    "total": 15,
    "avg_score": 130.1,
    "hot_categories": [["Security", 8], ["AI/LLM", 6], ["Infrastructure", 3]],
    "top_keywords": [["cloudflare", 4], ["security", 3], ["windows", 3], ...]
  },
  "github": {
    "top_languages": [["Python", 4], ["TypeScript", 3], ...],
    "top_repos": [{"name": "HKUDS/ClawTeam", "stars": 2793, ...}]
  },
  "insights": [
    "HN community discussion focused on Security / AI/LLM",
    "Most-starred new project: HKUDS/ClawTeam (2,793 ⭐)"
  ]
}

JSON

Step 3: Publishing to GitHub Pages

The final step copies the generated data files to a separate github-pages repo, commits, and pushes. The GitHub Pages site is a static HTML file that loads the JSON client-side with vanilla JS — no build step, no bundler.

def publish_to_github_pages():
    # Copy data files
    for json_file in (DATA_DIR / "docs" / "data").glob("*.json"):
        shutil.copy(json_file, GITHUB_PAGES_DATA_DIR / json_file.name)

    # Git operations in the github-pages directory
    run_command("git add data/", cwd=GITHUB_PAGES_DIR)
    run_command(f'git commit -m "auto: update data {today}"', cwd=GITHUB_PAGES_DIR)
    run_command("git push", cwd=GITHUB_PAGES_DIR)

Python

SSH tip: If your git remote is set up with SSH (git@github.com:...), make sure your deploy machine has the SSH key loaded. On macOS, add it to the keychain: ssh-add --apple-use-keychain ~/.ssh/id_ed25519

The Frontend: Vanilla JS on GitHub Pages

The daily.html page fetches a JSON index of all available reports, then loads the selected one on demand. No React, no dependencies — just fetch():

// Load report index
const reports = await fetch('/data/index.json').then(r => r.json());

// Render the latest one
const latest = reports[0];
const data = await fetch(`/data/${latest.date}.json`).then(r => r.json());

renderDigest(data);

JavaScript

The whole site is ~400 lines of HTML+CSS+JS with zero build pipeline. Deploys in under 5 seconds via git push.

Scheduling: WorkBuddy Automations

I run this on a 2014 MacBook Pro using WorkBuddy's built-in automation scheduler. The RRULE is simple:

[automation]
name = "daily-content-gen"
rrule = "FREQ=WEEKLY;BYDAY=MO,TU,WE,TH,FR,SA,SU;BYHOUR=7;BYMINUTE=0"
status = "ACTIVE"

[automation.prompt]
content = """
Run content production pipeline:
1. cd /path/to/content-producer
2. python3 generator.py
3. python3 analyzer.py
4. python3 publish_to_github_pages.py
5. git add + commit + push
"""

TOML

If you want a pure cron approach, the equivalent is:

0 7 * * * cd /path/to/content-producer && python3 generator.py && python3 analyzer.py && python3 publish_to_github_pages.py && git add . && git commit -m "auto: daily report $(date +%Y-%m-%d)" && git push

Bash

What I'd Do Differently

Add a GitHub token — unauthenticated API calls work fine for once-daily runs, but GitHub will occasionally rate-limit you if the machine's IP is shared.
Cache data locally — if the publish step fails, I lose the day's analysis. Writing to a local SQLite DB first would make retries trivial.
Add Reddit/lobste.rs — HN skews toward certain topics. A second source would surface more diverse stories.
LLM summarization — right now the "insights" are keyword-frequency heuristics. Passing the top stories to an LLM API would produce much better natural-language summaries.

Current state: The pipeline runs daily, produces consistent output, and costs $0/month (GitHub Pages is free, APIs are unauthenticated). It's not impressive engineering — it's just a boring, reliable script that does one thing well.

The Output

Every day, the pipeline produces something like this:

posts/2026-03-23.md — the formatted digest (committed to GitHub)
data/analysis-2026-03-23.json — structured trend data
https://citriac.github.io/daily.html — updated website

You can see the live output at citriac.github.io/daily.html. The full source is at github.com/citriac/content-producer.

Wrapping Up

This took a weekend to build. The hardest part wasn't the code — it was deciding what not to build. No database. No API. No framework. Just files and scripts.

The value isn't the tech — it's the compounding. Every morning there's a new digest. Every week there's more historical data. Every month the trend analysis gets more useful. Boring infrastructure is good infrastructure.

How I Built a Daily Tech Digest That Runs Itself

The Architecture

Step 1: Fetching Data

Hacker News

GitHub Trending

Step 2: Trend Analysis

Step 3: Publishing to GitHub Pages

The Frontend: Vanilla JS on GitHub Pages

Scheduling: WorkBuddy Automations

What I'd Do Differently

The Output

Wrapping Up

Want to run your own?