Hindsight’s 10 small LLMs — re-run on Mazemaker

Hindsight evaluated ten small open-source LLMs against a structured JSON-output protocol; all ten scored 0/N. The published result describes JSON-schema conformance, not retrieval quality. We ran the same ten models on the same haystack with plain-text prompting through Mazemaker: 188/200 = 94.0%.


The original evaluation

Hindsight published a benchmark where ten small open-source LLMs — from gemma3:270m through llama3.2:latest — were prompted with structured JSON output schemas. None produced a parseable conforming response. The published note tags each as not viable for the protocol as tested.

The reported failure mode was structured-output conformance: the models would attempt an answer in prose, but the JSON validator rejected the shape. That’s a useful finding about small-model schema-following — and a different question from can the system retrieve the fact that answers the question, and can the model use it once retrieved. The two questions deserve separate benches.

Our run

Same 10 models. Same haystack synthetic_v1 (n=20, hash b1e2de77fabe79a9). Same retrieval k=10, top-5 to prompt. The only change: we asked in plain English. Mazemaker fetched the relevant memories; the model wrote the answer in a sentence; we scored on substring match against the gold span.

ModelHindsightMazemaker (ColBERT @ 1.5)Recall p50JSON leaksErrors
gemma3:1b0/N19/20 = 95%0.45 s00
gemma3:12b0/N20/20 = 100%0.38 s00
gemma3:270m0/N18/20 = 90%0.40 s00
qwen2.5:0.5b0/N18/20 = 90%0.37 s00
qwen2.5:3b0/N19/20 = 95%0.38 s00
smollm2:1.7b0/N20/20 = 100%0.38 s00
deepseek-r1:1.5b0/N15/20 = 75%0.37 s00
granite3.1-dense:2b0/N20/20 = 100%0.36 s00
llama3.2:latest0/N19/20 = 95%0.41 s00
ministral-3:3b0/N20/20 = 100%0.36 s00
TOTAL188 / 200 = 94.0%

What this measures

This bench measures retrieval — can the system surface the fact that answers the question, then can the model use it. The original Hindsight evaluation measures something useful but different: whether a small open-source model can reliably produce a JSON object that validates against a given schema. Both are valid engineering questions, on the same models, with very different signals.

Reading the two numbers side-by-side: 0/N (JSON-schema conformance, original protocol) vs 188/200 (substring-match on the gold span, this protocol). The 270M-parameter model that scored 0/N for JSON output scored 18/20 = 90% for retrieval-then-prose. That delta is the point of running both.

Reproduce

git clone https://github.com/itsXactlY/mazemaker.git
cd mazemaker
python benchmarks/external/comparison_bench.py \
  --models gemma3:1b,gemma3:12b,gemma3:270m,qwen2.5:0.5b,qwen2.5:3b,\
smollm2:1.7b,deepseek-r1:1.5b,granite3.1-dense:2b,llama3.2:latest,ministral-3:3b \
  --dataset synthetic_v1 \
  --enable-colbert --colbert-weight 1.5 \
  --recall-mode advanced --rerank \
  --output results/my-run.json

~12 minutes on 16 GB VRAM. If your numbers differ, please open an issue with your environment + the result JSON.


Honest caveats

  1. Dataset is synthetic and small (20 questions × 10 models = 200 trials). Variance band is ±2–3 questions per model. Treat sub-5-point gaps as noise.
  2. Hindsight’s own benchmark may not have been intended as a memory-engine benchmark; we’re re-purposing it. The question this page answers is "would those 10 models be useful for memory tasks if you stopped gating on JSON?" — not "is Hindsight a bad benchmark?".
  3. Substring-match judging admits false positives if the gold string is a common phrase. We hand-spot-checked the high-scorers; the JSON has the per-question rows for any reader to spot-check independently.