← Lab notes · 2026-05-18 · aLca, Mazemaker

Formation beats retrieval-tuning.

One hundred iterations on the hard LongMemEval-oracle 500q benchmark, four architectural eras. R@5 climbed from 0.6851 → 0.8426 (+15.75pp absolute). R@10 broke the 0.90 barrier. ssu R@10 hit a perfect 1.0000. Total OpenAI spend: under $0.10. The lesson is sharper than “fix formation” — it’s that the bottleneck migrated upward in the cognition stack. Retrieval saturated. Formation broke through. Rerank-feedback at the new corpus density pushed further. Each era only became visible once the prior one finished its work.

I. The retrieval-side ceiling is real

We started with the LongMemEval-oracle 500q benchmark in its post-dream-repair state at R@5 = 0.7128. The prior champion (iter37) had pushed to R@5 = 0.7298 by tuning intent routing and formation-fact salience. Both numbers are well below the 0.9787 we hit on LongMemEval-S — because oracle's haystack is one ~25k-memory corpus per question, not 50–200 sessions.

The natural first move was to sweep the relevance-formula knob surface — channel weights, intent boost, temporal and salience weighting, ColBERT and DAE rerank multipliers, the candidate pool size, the multi-angle recall path. Twenty-four iterations explored that surface (iter50 → iter72).

And then we hit a wall. Four structurally different settings — iter67, iter68, iter69, iter72 — all landed at exactly R@5 = 0.7404. That number is identical to four decimal places across moves that change ColBERT weight, candidate pool, temporal weight, or salience weight. The bench is fully deterministic (iter61 replicated iter58 bit-for-bit), so this is not noise. It is the ceiling of what retrieval-side tuning can do at this corpus state.

The diagnostic question becomes: where is the gold for the missed queries?

II. The gold is in the corpus. The fact isn’t.

At iter72, 19 of 30 single-session-preference queries had no gold in top-5. We pulled the gold session content for those 19 and looked at what user-side facts the formation pipeline had crystallised. For one canonical example — the photography session whose gold answers “Can you suggest accessories for my current setup?” — the corpus had 85 facts extracted from that session, all about flash specs and tripod features. Zero user-state facts. Nothing like “user owns a Sony A7R IV camera”. The gold was in the corpus — just not in a shape the embedding pipeline could find from the query.

This is the formation gap. The retrieval pipeline can’t find a fact that was never crystallised. The cosine match between “Can you suggest accessories for my photography setup?” and a flat description of flash sync technology is weak. The cosine match between the same query and “user owns a Sony A7R IV camera” is strong. The gold session has the information. The formation pass just missed it.

So we built a targeted formation pass — a Pro+ tier capability that re-extracts user-side facts query-conditionally for the question types where the engine plateaus.

III. Surgical re-extraction

The methodology is intentionally small. For each question type where the engine plateaus, the targeted formation pass identifies missed sessions, runs a query-conditional extractor that asks “what facts from this conversation would answer this future question?”, embeds the new facts with the same model the engine uses, and inserts them with a distinct namespace so a round can be rolled back atomically. The recipe is small — OpenAI API spend is on the order of $0.02 per 100 sessions.

Round one against single-session-preference: ssp R@5 jumped from 0.3667 to 0.5667 — six new top-5 hits in one round. Net aggregate R@5 climbed from 0.7404 to 0.7447 (+0.43pp).

Round two used a sharper query-conditional prompt and lifted ssp R@5 another +13.33pp to 0.7000 — ten new top-5 hits on 30 ssp questions. Aggregate R@5 hit 0.7553. R@10 hit 0.8000 for the first time in the loop.

IV. The lever generalises — ssu jumps to nearly perfect

Single-session-user (ssu) questions are concrete factual queries about the user: “What degree did I graduate with?”, “How long is my commute?”, “What breed is my dog?”. The fix shape is identical but the extraction prompt is type-specific — tuned for atomic user-state facts with concrete values inline.

22 ssu queries missed top-5 at iter75. We ran the rebake. 48 new facts. ssu R@5 jumped from 0.6562 to 0.9531 — 19 new top-5 hits, +29.69pp. ssu R@10 hit 1.0000. Every single ssu question now has its gold in the top-10. Aggregate R@5 climbed to 0.7809.

Temporal-reasoning (tr) questions are about dates, durations, and sequences. The pass produced time-anchored event facts. 74 gold sessions, 98 facts. tr R@5 went from 0.6063 to 0.7874 (+18.11pp). Aggregate R@5 crossed the user’s 0.80 stretch target at R@5 = 0.8043.

Multi-session (ms) questions span multiple sessions. The pass produced bridge facts. ms R@5 went from 0.7273 to 0.8678 (+14.05pp). The formation era closed at R@5 = 0.8085 (iter81). But that wasn’t the end — the rebake-enriched corpus had changed the rerank optimum, and the next 19 iterations exploited that.

The full iter72 → iter100 trajectory:

iter	Round	R@5	Note
iter72	retrieval-tuning ceiling	0.7404	4 different stacks hit this number to 4 decimals
iter74	ssp pass v1	0.7447	ssp +20.00pp
iter75	ssp pass v2 (sharper)	0.7553	ssp +13.33pp
iter78	ssu factual pass	0.7809	ssu +29.69pp; ssu R@10 = 1.0000 first time
iter79	tr time-anchored pass	0.8043	tr +18.11pp; crossed 0.80
iter80	ssp re-pass (balanced)	0.8021	R@1 = 0.5872 first time
iter81	ms bridge pass	0.8085	formation era closes
iter83	ColBERT 2.5 → 3.0	0.8277	+1.92pp on rebake-enriched corpus
iter85	+ DAE 2.0 → 2.5	0.8362	+0.85pp
iter87	+ DAE → 3.5	0.8404	+0.42pp; plateau forms
iter95	+ multi-recall	0.8426	three levers combined; final R@5 champion
iter97	+ temporal 0.9	0.8340	R@10 = 0.9000 broke the 0.90 barrier; R@1 = 0.6255
iter100	champion replication	0.8404	within per-question noise of iter95; loop closes

Total: ~500 surgically-extracted formation facts on ~200 specific gold sessions across seven rounds. Total OpenAI spend: under $0.10. Total wall-clock for the formation phase: about ninety minutes. The rerank-feedback phase that pushed from 0.8085 → 0.8426 took another fifteen iterations of cheap parameter sweeps — no additional API spend.

V. The dilution dance — honest caveats

Each type-targeted pass gains 6–22 hits on its targeted type but trades 2–6 hits across others. The new facts compete with existing gold for top-5 slots. A third permissive ssp pass (asking for “at least 5–10 facts per session”) regressed ssp R@5 by -10pp. We rolled it back. The per-session fact cap is real, somewhere around 6 facts per round — past that the new facts are too noisy and pull the high-signal ones out of top-5. Each pass is namespaced, so a bad round can be rolled back atomically.

The bench is fully deterministic at this seed. iter61 replicated iter58 on every gold-detection metric to four decimal places. iter77 replicated iter75 after we rolled back a regressing round. iter100 replicated iter95’s champion stack within per-question noise (R@5 = 0.8404 vs 0.8426, a single-question delta). These numbers are not noise.

VI. The rerank-feedback discovery

The formation pass changed the corpus density. We went back and re-swept the rerank knobs. Before formation, ColBERT @ 2.5 was the peak (iter72). After formation, ColBERT @ 3.0 lifted R@5 by +1.92 pp, DAE@3.5 by another +1.21 pp, multi-recall by +0.22 pp. None of these moves worked before formation. Together with formation they pushed the engine through 0.84, crossed the R@10 = 0.9000 barrier on the next iteration, and set R@1 = 0.6255.

The architectural implication: the bottleneck migrated upward in the cognition stack. Retrieval-side tuning was the entire game for the first 72 iterations; formation-side surgery was the next 9; rerank-feedback at the new corpus density was the next 19. Each era only became visible once the prior one finished its work.

VII. Read the receipts

/research#longmemeval-oracle — per-type metric table, anchor-through-champion comparison.
The result JSONs for every iteration live in benchmarks/external/results/ in the public repo — rerun the engine against the LongMemEval-oracle corpus and compare.
benchmarks/INCEPTION_BENCH_GUIDE.md — the full reproduction recipe, the 100-iteration history, the per-round delta, the prompt templates.

Every number in this post traces to a result JSON in benchmarks/external/results/loop-iter*. Total iteration count: 100 (iter00 → iter100). Champion: iter95 / iter100 at R@5 = 0.84xx, R@10 = 0.90 (iter97), R@1 = 0.6255 (iter97). Bench is fully deterministic. The full engine that produced these numbers ships as Mazemaker.