A-MEM vs Mazemaker
A-MEM is the Zettelkasten-style agentic memory framework with LLM-driven note evolution. We will run it on LongMemEval-S 500q with the LLM-evolution stage both on and off, to isolate retrieval signal from generation lift. No number until the run lands.
Their architecture
A-MEM (paper-published) treats memories as Zettelkasten notes: each new memory is linked to existing ones via an LLM-driven evolution stage that decides which prior notes are relevant and how the new note refines them. Retrieval then walks the resulting note graph. The LLM is in the write path, not just the read path. Repo: agiresearch/A-mem.
Methodology — locked
- Dataset: LongMemEval-S, all 500 questions, identical haystack split.
- Retrieval: top-k=10 in both systems. A-MEM walks the evolved note graph; Mazemaker uses hybrid + ColBERT @ 1.5 + the existing graph (PPR-traversed only when explicitly enabled).
- Ablation: A-MEM is run twice — once with the LLM-evolution stage active (their default), once with it bypassed (vector search over raw notes). The delta isolates how much of A-MEM’s lift comes from retrieval vs. from LLM-driven note rewriting.
- Cost surface: total tokens spent on the evolution stage are reported alongside accuracy. Mazemaker’s consolidation runs idle — not paid per-turn — so the cost shape is meaningfully different.
- Judge: identical —
substring_match. - Reproducibility: shell script in
benchmarks/external/amem_run.sh, committed before the run.
Mazemaker reference (LongMemEval-S 500q)
| Config | R@1 | R@5 | R@10 | MRR | p50 |
|---|---|---|---|---|---|
| master baseline (hybrid) | 0.8064 | 0.9596 | 0.983 | 0.8733 | — |
| + ColBERT @ 1.5 | 0.8574 | 0.9787 | 0.9894 | 0.9114 | 56.9 ms |
Why this comparison matters
A-MEM and Mazemaker both ship knowledge graphs — but the LLM sits in very different places. A-MEM puts an LLM in the write path of every note: each new entry triggers note-evolution that rewrites neighbors. Mazemaker has one optional LLM call — user-triggered, run on demand for Stage C / synthesis crystallization, not per-turn. The rest of the write path (sponge ingest, embedding, dream consolidation, supersession) is fully mechanical. WIP: insourcing that one call to a local sub-1B model is an open thread — we’re searching for the smallest model that still passes our extraction rubric so the “no external token bill” story holds end-to-end.
We expect A-MEM’s per-turn LLM-evolution to produce semantically richer links, at substantial token cost; we also expect Mazemaker’s embedding-similarity links plus dream-cycle consolidation to recover most of that recall lift without the per-turn bill. The ablation tells us whether we’re right.
What “queued” means
- Harness exists. Methodology and ablation are locked.
- When the JSON lands, this page updates with the verified tables (with-evolution and without-evolution) — same shape as the Hindsight page.
- If A-MEM’s numbers reverse our finding, we publish them here, unedited.