93.8% on LongMemEval-S: How CortexDB's Retrieval Stack Stacks Up

Q: What is LongMemEval and why does it matter?

LongMemEval is a 500-question ICLR 2025 benchmark testing chat assistant memory over ~115K tokens of prior conversation. It exercises six memory capabilities and competitors publish on it, making comparisons apples-to-apples.

Q: How did CortexDB hit 93.8% on LongMemEval-S?

Per-session LLM fact extraction on ingest via gpt-4o-mini, hybrid HNSW + Tantivy retrieval with Reciprocal Rank Fusion, type-routed answer prompts on Claude Opus 4.6, and self-consistency voting on multi-session and temporal-reasoning categories.

In a single run against the production server with no in-memory shortcuts, CortexDB achieved 93.8% (469 of 500) on LongMemEval-S, costing $49.69 in API credits and placing it above Mem0's self-reported 93.4%. CortexDB—the long-term memory layer for AI agents built by Apache Cassandra co-creator Prashant Malik—relies on a lossless event-sourced architecture and 4-channel hybrid retrieval (WAL, RocksDB, HNSW, Tantivy) to hit this mark on the ICLR 2025 benchmark for long-term assistant memory.

93.8%

469 / 500 correct on LongMemEval-S, single run, no retry targeting.

~2 hours

End-to-end wall clock (server-parity, 4-way parallelism).

$49.69

Total API cost to run the benchmark from scratch.

The score is the server-parity result: every question runs through the production CortexDB API (WAL, RocksDB, HNSW, Tantivy, and Cognitive Recall). It supersedes the earlier 92.8% number, which used an in-memory research pipeline that bypassed the server. This post walks through the stack, the costs, the per-category breakdown, the leaderboard position, and the experiments that did not work.

Why LongMemEval is the benchmark to take seriously

LongMemEval (ICLR 2025) tests six distinct memory capabilities in 500 hand-written, time-stamped question-answer pairs against ~115K tokens of prior conversation per question.

| Capability | Example | | tag: "Benchmark" ---|---| | Knowledge update | "Where do I currently keep my old sneakers?" (the user said one place in March, a different place in May) | | Multi-session aggregation | "How many weddings have I attended this year?" | | Temporal reasoning | "How many weeks ago did I attend the 'Summer Nights' festival?" | | Single-session assistant recall | "What was the Jamaican dish you recommended I try?" | | Single-session user recall | "What colour did I repaint my bedroom walls?" | | Single-session preference | "I'm planning a trip to Denver, suggestions tailored to my prior preferences?" |

The benchmark is worth taking seriously because competitors publish on it (direct apples-to-apples), each question carries ~50 prior sessions of noise that retrieval has to cut through, six question types force breadth so a single-trick stack cannot cheat, and the gold labels are mostly clean (with a few documented edge cases).

The pipeline running on the production server

Every question runs through the same end-to-end path. No fine-tuning. No custom models. All primitives that can be purchased today.

                    PER-QUESTION PIPELINE (production server)
┌──────────────────── WRITE (one-time per dataset) ────────────────────┐
│  raw sessions                                                          │
│     ↓                                                                  │
│  per-session LLM fact extraction       ·  gpt-4o-mini × 8 workers      │
│     ↓                                                                  │
│  chunks (raw + FACT/EVENT extracts)                                    │
│     ↓                                                                  │
│  OpenAI text-embedding-3-small (1536 dims, configurable to Cohere)     │
│     ↓                                                                  │
│  CortexDB write path: WAL → RocksDB → HNSW + Tantivy                   │
└────────────────────────────────────────────────────────────────────────┘
┌──────────────────── READ (per question, ~9.7 s p50) ─────────────────┐
│  HyDE passage   +   decomposed sub-queries                             │
│     ↓                                                                  │
│  HNSW vector + Tantivy BM25 in parallel  ·  RRF fusion                 │
│     ↓                                                                  │
│  token-budgeted packing (max_recall_tokens)                            │
│     ↓                                                                  │
│  optional Cohere rerank-v3.5 (off in the canonical 93.8% run)          │
│     ↓                                                                  │
│  type-routed prompt   +   Claude Opus 4.6                              │
│     ↓                                                                  │
│  self-consistency on multi-session + temporal + low-confidence:        │
│     2 extra Claude samples @ temp 0.4, 0.7  →  majority vote           │
└────────────────────────────────────────────────────────────────────────┘
                                  ↓
                    gpt-4.1 LLM-as-judge  (CORRECT / WRONG)

Five ideas carry most of the result.

LLM fact extraction on the ingest side. gpt-4o-mini extracts atomic FACT and EVENT triples per session and stores them alongside the raw text. The extraction runs asynchronously, off the user-facing write path.
Hybrid HNSW + BM25 retrieval. Both indexes run in parallel and fuse with Reciprocal Rank Fusion. BM25 is essential for keyword-heavy questions ("painting of a sunset worth"). HNSW carries the semantic load. Together they cover the query distribution that single-channel retrieval misses.
Type-routed answer prompts. Six system prompts, one per question type. Knowledge-update gets a "most-recent-date-wins" protocol. Multi-session gets an explicit SCOPE → ENUMERATE → DEDUPE → COUNT procedure. Temporal gets a timeline preamble. Each prompt encodes the answer protocol for its category.
Self-consistency on the hard categories. For multi-session and temporal-reasoning, the pipeline generates two additional Claude samples at temperature 0.4 and 0.7, then majority-votes on the extracted final answer.
Server-parity, not research-mode. The 93.8% number is what POST /v1/answer returns. The same code path serves production traffic.

Per-category scores

single-session-assistant · 100%

56 / 56. Perfect recall of what the assistant previously said. Long assistant responses are chunked so nothing gets truncated.

knowledge-update · 97.4%

76 / 78. "Sort by date, most recent wins" prompt plus session retrieval handles the category cleanly.

single-session-user · 95.7%

67 / 70. 3 wrong. Claude occasionally hedges when the answer is explicit in the source.

single-session-preference · 93.3%

28 / 30. Tailored recommendations. A judge-truncation bug was fixed where preference answers were scored only on their closing pleasantry.

temporal-reasoning · 91.7%

122 / 133. 11 wrong. Most failures are "what is the order of these N events" where chronology is present in retrieval but Claude misorders.

multi-session · 90.2%

120 / 133. 13 wrong. Scope disambiguation remains the hardest category. The 90% line was crossed on this push (up from 85.7% on the in-memory baseline).

Position on the leaderboard

Rank	System	Score	Notes
1	Supermemory ensemble	98.60%	8 specialist prompts voting
2	MemPalace hybrid*	96.60%	ChromaDB + per-question tuning
3	AgentMemory (J. McCann)	96.20%	6-signal hybrid + Claude Opus
4	OMEGA task-weighted	95.40%	Local-first, GPT-4.1
5	Mastra Observational Memory	94.87%	Observer/Reflector, gpt-5-mini
6	CortexDB (this work)	93.80%	Claude Opus 4.6 + hybrid retrieval, server-parity
7	Mem0 (self-reported)	93.40%
8	OMEGA raw	93.20%
9	Chronos	92.60%	Dual turn + event retrieval
10	Hindsight	91.40%	4-network structured memory
11	Emergence AI	86.00%	Session-level NDCG RAG
n/a	Oracle GPT-4o (upper bound)	82.40%
12	EverMemOS	82.00%

* MemPalace's 96.6% is disputed. The community has documented per-question hand-engineering.

We report CortexDB sits at 93.80%, 0.4 points above Mem0's self-reported 93.4% (the equivalent of +2 out of 500 questions). Above the line are ensemble systems and multi-prompt configurations. Below the line is essentially every published single-answerer system.

Cost and latency

Per-question p50 measurements from the canonical run (stage_phase1_strict_recheck_20260508_201609):

Phase	p50
`server_recall_ms`	2,868 ms (HNSW + Tantivy + RRF + scoring + packing)
`server_gen_ms`	2,545 ms (Claude Opus 4.6 anchor + verify samples where triggered)
Overhead (verifier swap, retries, extra recalls)	3,632 ms
Total	9,707 ms

Cost averages $0.12 per question, $49.69 for the full 500-question run.

Cost-saver alt-tier. Swap Claude Sonnet 4.6 in as the answerer via CORTEX_ANSWER_MODEL=claude-sonnet-4-6. The trade: -1.0 percentage point accuracy (92.8% / 464 of 500), -74% cost ($15.71 per 500-question run), -25% generation latency. Single environment variable, same code path. Documented in docs/LATENCY_COST_FINDINGS_2026_05_10.md.

What was tried that did not work

The negative results matter as much as the positive ones because they let future latency and accuracy work skip cycles. Highlights:

Lever	Result	Lesson
Disable supplemental + adaptive recall passes	96.7% (-1.6 pp) on balanced60, -19% cost, -56% tail latency	Worth shipping as an opt-in cost-saver, not the default
Parallelise supplemental recall with primary	+19% slower at p50	RocksDB column-family contention; naive `tokio::spawn_blocking` does not help I/O-bound retrieval
Sonnet 4.6 as the answerer	92.8% (-1.0 pp), -74% cost, -25% generation latency	Ship as opt-in alt-tier, not the canonical default
Lower `max_recall_tokens` from 8,000 to 4,000	(untested)	Plausible +50 to +150 ms savings, -1 to -3 pp accuracy

The meta-lesson: most "obvious" quality levers lose when applied broadly. The wins are narrow and defensive: judge fixes, type-routed prompts, strict-where-strict-matters scope control.

Reproducing these results

We report these results in public docs based on our internal pipeline runs. A public reproduction repository that executes these commands against raw artifacts is not yet reachable. Internally, the execution runs via our benchmark script:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export COHERE_API_KEY=...  # only needed when re-enabling the reranker

./benchmarks/longmemeval/reproduce.sh

Every run emits a RUN_MANIFEST with the git SHA, the exact config, and API-key hashes (not the keys) for auditable history. We are working toward a public repository.

What this proves and what it does not

What 93.8% proves. The CortexDB production pipeline, the same code path that serves real /v1/answer traffic, is competitive with every honest published memory system on LongMemEval-S. We report CortexDB sits above Mem0's self-reported number on the benchmark that matters for assistant memory.

What 93.8% does not prove. Speed and durability beyond LongMemEval are different concerns. LongMemEval is saturating; the next frontier is longitudinal durability, where no published baseline yet exists.

Honest counter-balance. On the LoCoMo benchmark, Mem0's self-reported April 2026 score is 91.6% and CortexDB's current best is 86.9% (cats 1 to 4). CortexDB trails LoCoMo by 4.7 points. The next round of recall-side work is focused there.

Frequently asked questions

What is LongMemEval and why does it matter?

LongMemEval is a 500-question benchmark from ICLR 2025 that tests a chat assistant's ability to answer questions about ~115K tokens of prior conversation. It exercises six memory capabilities: knowledge update, multi-session aggregation, temporal reasoning, single-session assistant recall, single-session user recall, and single-session preference. The benchmark matters because competitors publish on it, making comparisons apples-to-apples.

How did CortexDB hit 93.8% on LongMemEval-S?

The result comes from running the production server pipeline: per-session LLM fact extraction on the ingest side via gpt-4o-mini, hybrid HNSW + Tantivy retrieval with Reciprocal Rank Fusion, token-budgeted packing, type-routed answer prompts on Claude Opus 4.6, and self-consistency voting on the multi-session and temporal-reasoning categories.

Is the 93.8% number reproducible?

CortexDB reports these internal results in public docs. A public reproduction repository for this specific benchmark suite is not yet reachable, but internally, a single command reproduces the full result with ~$50 of API credits over ~2 hours on 4-way parallelism.

How does CortexDB compare to Mem0 on LongMemEval?

CortexDB reports 93.8% (469 of 500) against Mem0's self-reported 93.4% on LongMemEval-S, leading by 0.4 percentage points (the equivalent of +2 out of 500 questions). On the LoCoMo benchmark Mem0 currently leads at 91.6% against CortexDB's 86.9% across categories 1 to 4.

What does the per-question cost break down to?

$0.12 per question on average, $49.69 for the full 500-question run. The cost-saver alternative tier (swap Claude Sonnet 4.6 for Claude Opus 4.6 as the answerer) drops the run cost to $15.71 with a 1.0-point accuracy trade-off.

What is server-parity in this context?

Server-parity means every benchmark question runs through the same POST /v1/answer endpoint that serves production traffic, exercising the WAL, RocksDB, HNSW, Tantivy, and Cognitive Recall. It contrasts with research-mode pipelines that bypass the server for in-memory evaluation.

Where can the methodology be audited?

We plan to release a public reproduction repository containing the methodology, reproduction guide, and latency/cost findings soon.