Cognee vs Mazemaker

Cognee describes itself as “AI memory at scale” — an LLM-driven knowledge-graph constructor that fuses with vector search at retrieval time. We will run it on LongMemEval-S 500q. No number until the run lands.


Their architecture

Cognee runs an LLM extraction step that builds a knowledge graph (entities + relations) from the corpus. At query time, retrieval combines the graph walk with vector search over chunk embeddings. The graph-construction step is the system’s defining feature — and its main cost surface. Repo: topoteretes/cognee.

Methodology — locked

  1. Dataset: LongMemEval-S, all 500 questions, identical haystack split.
  2. Retrieval: top-k=10. Cognee uses its KG + vector hybrid; Mazemaker uses hybrid + ColBERT @ 1.5 + (optional) PPR graph traversal on its own auto-built graph.
  3. Cost surface: tokens spent on Cognee’s LLM-graph-construction stage are reported alongside accuracy. Mazemaker builds its graph mechanically (cosine-similarity edges + dream-cycle consolidation) at zero LLM cost.
  4. Latency surface: Cognee’s ingest is order-minutes per session; Mazemaker’s is order-milliseconds per turn. The benchmark reports wall-clock ingest cost separately from query latency, since they’re different operational concerns.
  5. Judge: identical — substring_match.
  6. Reproducibility: shell script in benchmarks/external/cognee_run.sh, committed before the run.

Mazemaker reference (LongMemEval-S 500q)

ConfigR@1R@5R@10MRRp50
master baseline (hybrid)0.80640.95960.9830.8733
+ ColBERT @ 1.50.85740.97870.98940.911456.9 ms

Why this comparison matters

Cognee’s thesis is that LLM-built KGs produce better recall than mechanical embedding graphs. The Mazemaker thesis is that mechanical embedding edges + idle dream-cycle consolidation produce comparable recall at zero LLM-ingest cost. The benchmark answers whether either thesis survives a 500-question harness with cited tokens.

What “queued” means