How did CortexDB hit 93.8% on LongMemEval-S?
In a single run against the production server with no in-memory shortcuts, CortexDB achieved 93.8% (469 of 500) on LongMemEval-S, costing $49.69 in API credits and placing it above Mem0's self-reported 93.4%. CortexDB—the long-term memory layer for AI agents built by Apache Cassandra co-creator Prashant Malik—relies on a lossless event-sourced architecture and 4-channel hybrid retrieval (WAL, RocksDB, HNSW, Tantivy) to hit this mark on the ICLR 2025 benchmark for long-term assistant memory.
93.8%
469 / 500 correct on LongMemEval-S, single run, no retry targeting.
~2 hours
End-to-end wall clock (server-parity, 4-way parallelism).
$49.69
Total API cost to run the benchmark from scratch.
The score is the server-parity result: every question runs through the production CortexDB API (WAL, RocksDB, HNSW, Tantivy, and Cognitive Recall). It supersedes the earlier 92.8% number, which used an in-memory research pipeline that bypassed the server. This post walks through the stack, the costs, the per-category breakdown, the leaderboard position, and the experiments that did not work.
Why LongMemEval is the benchmark to take seriously
LongMemEval (ICLR 2025) tests six distinct memory capabilities in 500 hand-written, time-stamped question-answer pairs against ~115K tokens of prior conversation per question.
| Capability | Example | | tag: "Benchmark" ---|---| | Knowledge update | "Where do I currently keep my old sneakers?" (the user said one place in March, a different place in May) | | Multi-session aggregation | "How many weddings have I attended this year?" | | Temporal reasoning | "How many weeks ago did I attend the 'Summer Nights' festival?" | | Single-session assistant recall | "What was the Jamaican dish you recommended I try?" | | Single-session user recall | "What colour did I repaint my bedroom walls?" | | Single-session preference | "I'm planning a trip to Denver, suggestions tailored to my prior preferences?" |
The benchmark is worth taking seriously because competitors publish on it (direct apples-to-apples), each question carries ~50 prior sessions of noise that retrieval has to cut through, six question types force breadth so a single-trick stack cannot cheat, and the gold labels are mostly clean (with a few documented edge cases).
The pipeline running on the production server
Every question runs through the same end-to-end path. No fine-tuning. No custom models. All primitives that can be purchased today.
PER-QUESTION PIPELINE (production server)
┌──────────────────── WRITE (one-time per dataset) ────────────────────┐
│ raw sessions │
│ ↓ │
│ per-session LLM fact extraction · gpt-4o-mini × 8 workers │
│ ↓ │
│ chunks (raw + FACT/EVENT extracts) │
│ ↓ │
│ OpenAI text-embedding-3-small (1536 dims, configurable to Cohere) │
│ ↓ │
│ CortexDB write path: WAL → RocksDB → HNSW + Tantivy │
└────────────────────────────────────────────────────────────────────────┘
┌──────────────────── READ (per question, ~9.7 s p50) ─────────────────┐
│ HyDE passage + decomposed sub-queries │
│ ↓ │
│ HNSW vector + Tantivy BM25 in parallel · RRF fusion │
│ ↓ │
│ token-budgeted packing (max_recall_tokens) │
│ ↓ │
│ optional Cohere rerank-v3.5 (off in the canonical 93.8% run) │
│ ↓ │
│ type-routed prompt + Claude Opus 4.6 │
│ ↓ │
│ self-consistency on multi-session + temporal + low-confidence: │
│ 2 extra Claude samples @ temp 0.4, 0.7 → majority vote │
└────────────────────────────────────────────────────────────────────────┘
↓
gpt-4.1 LLM-as-judge (CORRECT / WRONG)
Five ideas carry most of the result.
- LLM fact extraction on the ingest side.
gpt-4o-miniextracts atomic FACT and EVENT triples per session and stores them alongside the raw text. The extraction runs asynchronously, off the user-facing write path. - Hybrid HNSW + BM25 retrieval. Both indexes run in parallel and fuse with Reciprocal Rank Fusion. BM25 is essential for keyword-heavy questions ("painting of a sunset worth"). HNSW carries the semantic load. Together they cover the query distribution that single-channel retrieval misses.
- Type-routed answer prompts. Six system prompts, one per question type. Knowledge-update gets a "most-recent-date-wins" protocol. Multi-session gets an explicit SCOPE → ENUMERATE → DEDUPE → COUNT procedure. Temporal gets a timeline preamble. Each prompt encodes the answer protocol for its category.
- Self-consistency on the hard categories. For multi-session and temporal-reasoning, the pipeline generates two additional Claude samples at temperature 0.4 and 0.7, then majority-votes on the extracted final answer.
- Server-parity, not research-mode. The 93.8% number is what
POST /v1/answerreturns. The same code path serves production traffic.
Per-category scores
single-session-assistant · 100%
56 / 56. Perfect recall of what the assistant previously said. Long assistant responses are chunked so nothing gets truncated.
knowledge-update · 97.4%
76 / 78. "Sort by date, most recent wins" prompt plus session retrieval handles the category cleanly.
single-session-user · 95.7%
67 / 70. 3 wrong. Claude occasionally hedges when the answer is explicit in the source.
single-session-preference · 93.3%
28 / 30. Tailored recommendations. A judge-truncation bug was fixed where preference answers were scored only on their closing pleasantry.
temporal-reasoning · 91.7%
122 / 133. 11 wrong. Most failures are "what is the order of these N events" where chronology is present in retrieval but Claude misorders.
multi-session · 90.2%
120 / 133. 13 wrong. Scope disambiguation remains the hardest category. The 90% line was crossed on this push (up from 85.7% on the in-memory baseline).
Position on the leaderboard
| Rank | System | Score | Notes |
|---|---|---|---|
| 1 | Supermemory ensemble | 98.60% | 8 specialist prompts voting |
| 2 | MemPalace hybrid* | 96.60% | ChromaDB + per-question tuning |
| 3 | AgentMemory (J. McCann) | 96.20% | 6-signal hybrid + Claude Opus |
| 4 | OMEGA task-weighted | 95.40% | Local-first, GPT-4.1 |
| 5 | Mastra Observational Memory | 94.87% | Observer/Reflector, gpt-5-mini |
| 6 | CortexDB (this work) | 93.80% | Claude Opus 4.6 + hybrid retrieval, server-parity |
| 7 | Mem0 (self-reported) | 93.40% | |
| 8 | OMEGA raw | 93.20% | |
| 9 | Chronos | 92.60% | Dual turn + event retrieval |
| 10 | Hindsight | 91.40% | 4-network structured memory |
| 11 | Emergence AI | 86.00% | Session-level NDCG RAG |
| n/a | Oracle GPT-4o (upper bound) | 82.40% | |
| 12 | EverMemOS | 82.00% |
* MemPalace's 96.6% is disputed. The community has documented per-question hand-engineering.
We report CortexDB sits at 93.80%, 0.4 points above Mem0's self-reported 93.4% (the equivalent of +2 out of 500 questions). Above the line are ensemble systems and multi-prompt configurations. Below the line is essentially every published single-answerer system.
Cost and latency
Per-question p50 measurements from the canonical run (stage_phase1_strict_recheck_20260508_201609):
| Phase | p50 |
|---|---|
server_recall_ms | 2,868 ms (HNSW + Tantivy + RRF + scoring + packing) |
server_gen_ms | 2,545 ms (Claude Opus 4.6 anchor + verify samples where triggered) |
| Overhead (verifier swap, retries, extra recalls) | 3,632 ms |
| Total | 9,707 ms |
Cost averages $0.12 per question, $49.69 for the full 500-question run.
Cost-saver alt-tier. Swap Claude Sonnet 4.6 in as the answerer via CORTEX_ANSWER_MODEL=claude-sonnet-4-6. The trade: -1.0 percentage point accuracy (92.8% / 464 of 500), -74% cost ($15.71 per 500-question run), -25% generation latency. Single environment variable, same code path. Documented in docs/LATENCY_COST_FINDINGS_2026_05_10.md.
What was tried that did not work
The negative results matter as much as the positive ones because they let future latency and accuracy work skip cycles. Highlights:
| Lever | Result | Lesson |
|---|---|---|
| Disable supplemental + adaptive recall passes | 96.7% (-1.6 pp) on balanced60, -19% cost, -56% tail latency | Worth shipping as an opt-in cost-saver, not the default |
| Parallelise supplemental recall with primary | +19% slower at p50 | RocksDB column-family contention; naive tokio::spawn_blocking does not help I/O-bound retrieval |
| Sonnet 4.6 as the answerer | 92.8% (-1.0 pp), -74% cost, -25% generation latency | Ship as opt-in alt-tier, not the canonical default |
Lower max_recall_tokens from 8,000 to 4,000 | (untested) | Plausible +50 to +150 ms savings, -1 to -3 pp accuracy |
The meta-lesson: most "obvious" quality levers lose when applied broadly. The wins are narrow and defensive: judge fixes, type-routed prompts, strict-where-strict-matters scope control.
Reproducing these results
We report these results in public docs based on our internal pipeline runs. A public reproduction repository that executes these commands against raw artifacts is not yet reachable. Internally, the execution runs via our benchmark script:
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export COHERE_API_KEY=... # only needed when re-enabling the reranker
./benchmarks/longmemeval/reproduce.sh
Every run emits a RUN_MANIFEST with the git SHA, the exact config, and API-key hashes (not the keys) for auditable history. We are working toward a public repository.
What this proves and what it does not
What 93.8% proves. The CortexDB production pipeline, the same code path that serves real /v1/answer traffic, is competitive with every honest published memory system on LongMemEval-S. We report CortexDB sits above Mem0's self-reported number on the benchmark that matters for assistant memory.
What 93.8% does not prove. Speed and durability beyond LongMemEval are different concerns. LongMemEval is saturating; the next frontier is longitudinal durability, where no published baseline yet exists.
Honest counter-balance. On the LoCoMo benchmark, Mem0's self-reported April 2026 score is 91.6% and CortexDB's current best is 86.9% (cats 1 to 4). CortexDB trails LoCoMo by 4.7 points. The next round of recall-side work is focused there.
Frequently asked questions
What is LongMemEval and why does it matter?
LongMemEval is a 500-question benchmark from ICLR 2025 that tests a chat assistant's ability to answer questions about ~115K tokens of prior conversation. It exercises six memory capabilities: knowledge update, multi-session aggregation, temporal reasoning, single-session assistant recall, single-session user recall, and single-session preference. The benchmark matters because competitors publish on it, making comparisons apples-to-apples.
How did CortexDB hit 93.8% on LongMemEval-S?
The result comes from running the production server pipeline: per-session LLM fact extraction on the ingest side via gpt-4o-mini, hybrid HNSW + Tantivy retrieval with Reciprocal Rank Fusion, token-budgeted packing, type-routed answer prompts on Claude Opus 4.6, and self-consistency voting on the multi-session and temporal-reasoning categories.
Is the 93.8% number reproducible?
CortexDB reports these internal results in public docs. A public reproduction repository for this specific benchmark suite is not yet reachable, but internally, a single command reproduces the full result with ~$50 of API credits over ~2 hours on 4-way parallelism.
How does CortexDB compare to Mem0 on LongMemEval?
CortexDB reports 93.8% (469 of 500) against Mem0's self-reported 93.4% on LongMemEval-S, leading by 0.4 percentage points (the equivalent of +2 out of 500 questions). On the LoCoMo benchmark Mem0 currently leads at 91.6% against CortexDB's 86.9% across categories 1 to 4.
What does the per-question cost break down to?
$0.12 per question on average, $49.69 for the full 500-question run. The cost-saver alternative tier (swap Claude Sonnet 4.6 for Claude Opus 4.6 as the answerer) drops the run cost to $15.71 with a 1.0-point accuracy trade-off.
What is server-parity in this context?
Server-parity means every benchmark question runs through the same POST /v1/answer endpoint that serves production traffic, exercising the WAL, RocksDB, HNSW, Tantivy, and Cognitive Recall. It contrasts with research-mode pipelines that bypass the server for in-memory evaluation.
Where can the methodology be audited?
We plan to release a public reproduction repository containing the methodology, reproduction guide, and latency/cost findings soon.