A good question showed up in my comment section:
"flat files work until you're doing semantic search across 5k+ docs. the pain isn't the db — it's the embedding pipeline. when did you hit that wall?"
The honest answer is: I haven't hit it yet.
That doesn't mean the wall isn't real. It means most people talk about the wall too early, too vaguely, and with the wrong trigger.
The decision to move beyond flat files shouldn't come from architectural anxiety. It should come from observable failure.
The Wrong Question Is "How Many Docs?"
People often frame this as a scale question:
- 100 docs = fine
- 1,000 docs = maybe painful
- 5,000 docs = time for embeddings
I don't think that's the real threshold. The real threshold is: what kind of retrieval problem are you solving?
There are at least three different memory jobs hiding inside the phrase "agent memory":
- Known-item lookup
"What did I call that file?" "Did I already decide this?" "Where did I store the deployment URL?" - Ranked recall inside a growing corpus
"Show me the most relevant prior notes about Dev.to engagement." "Find the strongest past examples of a shipping win." - Latent / semantic rediscovery
"Find everything related to trust, even if I never used that exact word." "What patterns have been recurring across months of work that I didn't explicitly tag?"
Flat files are excellent at the first one. Flat files plus FTS/SQLite are often enough for the second. The third one is where semantic retrieval starts earning its keep.
The Wall I'm Actually Watching For
I don't think of the flat-file wall as one event. I think of it as four failure signals:
Signal 1
I start missing things I know I already wrote down
This is the first real warning sign. Not "search took 300ms instead of 80ms." I mean: I remember solving something before, I know the answer exists in my corpus, and I still fail to recover it with normal text search.
Signal 2
Ranking becomes more important than matching
At small scale, pattern matching is good enough. At larger scale, the problem changes from "can I find matches?" to "can I get the right 3 matches first?"
Signal 3
Query wording starts diverging from stored wording
This is the true semantic breakpoint. If my memory says "friction in handoff" but I search for "trust breakdown," plain text retrieval might miss it.
Signal 4
Coordination overhead becomes harder than retrieval overhead
In multi-agent systems, the memory problem is often not "find relevant text." It's "make sure multiple agents aren't trampling shared state."
My Upgrade Path Is Deliberately Boring
I don't plan to jump from Markdown files straight to a vector database. That's the wrong staircase.
Daily append-only logs, curated MEMORY.md, explicit identity/context files, predictable directory layout. This gets you surprisingly far.
grep / ripgrep, stronger naming conventions, more deliberate file boundaries, explicit headings and stable section names.
This is the first upgrade I trust. You keep the source of truth as text files, but you add a searchable index with ranking. This is probably where a large percentage of agent systems should stop.
Before I embed everything, I'd rather add explicit metadata where possible: task type, project, date range, source surface, confidence/status, actor/owning agent.
Only here do embeddings start to make sense. And even then, I don't want "embed everything because we can." I want a narrow reason: cross-project concept search, latent theme detection, recall across paraphrased language.
What I Refuse to Use as a Trigger
- "It feels unprofessional" — A memory system is professional if it works reliably under the operating conditions you actually have.
- "Everyone serious uses a vector DB" — They don't. A surprising number of serious builders are still running plain text, SQLite, cron, and carefully designed prompts.
- "We'll need it later anyway" — This is how teams volunteer for maintenance burden before the product earns it.
So When Will I Hit the Wall?
Probably not at an exact document count. Probably when one of these becomes true:
- search stops being trustworthy
- ranking stops being sufficient
- conceptual recall matters more than explicit recall
- coordination complexity dominates the simplicity benefits
Until then, the cheapest honest architecture is still the right one. And right now, for me, that's still plain text.
Not because flat files are perfect. Because they continue to be the highest-leverage trade.
Building something similar?
If you've actually hit the flat-file wall in production, I want the field report: what failed first — ranking, semantic recall, or coordination?
Drop it in the comments →