Letta vs Mazemaker

Letta (formerly MemGPT) ships an OS-style memory hierarchy — main context + archival + recall channels — with public LongMemEval results in their paper. We will run it on the same harness as our master baseline and publish the JSON here. No number until the run lands.


Their architecture

Letta’s thesis is "memory as an OS": the LLM-context window is treated as RAM, with archival and recall channels swapping into and out of context via tool calls. The agent reasons about when to fetch its own memory rather than retrieving on every turn. Public benchmarks live in their paper and the letta repo.

Methodology — locked

  1. Dataset: LongMemEval-S, all 500 questions, identical haystack split.
  2. Retrieval: top-k=10. Letta gets its native main+archival+recall channels; Mazemaker gets hybrid + ColBERT @ 1.5.
  3. Judge: identical — substring_match on the LongMemEval gold span.
  4. Metrics: R@1, R@5, R@10, MRR, p50/p95 retrieval latency, total tokens spent on tool-call overhead.
  5. Hardware: 16 GB VRAM, identical embedding backend (BGE-M3 1024d) where the system supports an external embedder; otherwise document the engine’s default.
  6. Reproducibility: a single shell script in benchmarks/external/letta_run.sh, committed before the run.

Mazemaker reference (same harness)

ConfigR@1R@5R@10MRRp50
master baseline (hybrid)0.80640.95960.9830.8733
+ ColBERT @ 1.50.85740.97870.98940.911456.9 ms

What “queued” means


Why this comparison matters

Letta and Mazemaker disagree about where the memory work should happen. Letta puts a tool-calling LLM in the loop — the agent decides when to consult memory. Mazemaker puts memory in the retrieval path — the agent always sees the top-k and never has to ask. Both can be the right call. The benchmark answers which one retrieves the fact more often, at what cost, on the same questions.

The cost axis matters because Letta’s tool-call loop is not free: every "consult archival" decision is an extra LLM round-trip. We’ll report total tokens spent so the cost-vs-quality tradeoff is legible.