Reproducible scores on LongMemEval-S and LoCoMo — the two standardized long-term-memory benchmarks the research community has rallied around — with full methodology, per-category breakdowns, and a one-command repro path.
CortexDB v1 on the Public Memory Benchmarks
Abstract
We report CortexDB v1's scores on the two public, standardized long-term-memory benchmarks for conversational AI systems: LongMemEval-S (ICLR 2025; 500 questions across six skill categories) and LoCoMo (NAACL 2024; 1,540 QA pairs across long conversations). On LongMemEval-S, CortexDB v1 reaches 93.8% accuracy (469/500) with the server-parity pipeline, exceeding the published Mem0 result (93.4%) and placing it at the top of the honest leaderboard. On LoCoMo categories 1–4 (the answerable categories the field reports on), CortexDB v1 reaches 86.9% (1,339/1,540) with the same write-path code that ships in production. Both runs are single-attempt, no retry targeting, no gold-oracle leakage; cost and wall-clock are reported per run and reproducible from one command. We describe the methodology, the per-category breakdowns, the ablations that drove the two largest deltas (Cohere rerank-v3.5; the v1 typed FactStore), and the threats to validity.
1. Why benchmark at all
A long-term memory layer for AI agents is straightforward to demo and very hard to verify. Anyone can construct a 10-message conversation, store it, retrieve it, and claim "memory works." The interesting questions — does this still work across hundreds of sessions; does it answer temporal questions correctly; does it cite the right turn out of thousands — only show up at scale, on test sets the system author didn't design.
Two public benchmarks now serve this role:
- LongMemEval-S (Wu et al., 2025) — 500 hand-labeled questions across six skill categories (single-session-assistant, single-session-user, single-session-preference, multi-session, knowledge-update, temporal-reasoning). The "-S" panel uses ~115k input tokens per question on average, sized to fit the working set of a real production memory.
- LoCoMo (Maharana et al., 2024) — 1,540 QA pairs across very-long-form conversations (avg. 9,000 turns per conversation). Four answerable categories: single-hop, multi-hop, temporal, open-domain.
We report both. We use the official datasets, the official evaluator scripts (LLM-judge for both), and the same write-path that ships in production CortexDB v1.
2. Headline numbers
| Benchmark | Score | Detail | Cohort | Date |
|---|---|---|---|---|
| LongMemEval-S | 93.8% | 469 / 500 | server parity (production write path) | 2026-05-16 |
| LoCoMo (cats 1–4) | 86.9% | 1,339 / 1,540 | server parity | 2026-05-12 |
Both runs are single-attempt, no retry targeting, no gold-oracle leakage. Each is reproducible from a single command against the cortexv2 repository.
3. Methodology
3.1 What "server parity" means
Many published results on these benchmarks use bespoke evaluator-only code paths that don't survive contact with a real ingestion pipeline (LLM-rewritten memories, hand-tuned per-question retrieval, oracle-fed gold snippets). We define server parity as a run where:
- Every memory is written via
POST /v1/experience— the same endpoint a paying customer hits. - Every retrieval is via
POST /v1/recallandPOST /v1/answer— the same endpoints exposed to the public SaaS. - No question-specific tuning. The same scope, the same view, the same diagnostics setting across every question in the panel.
- No oracle access at retrieval time. The retriever sees only the embedded conversation, never the question's gold snippets.
Server parity is the honest leaderboard. Numbers reported below are all server-parity unless noted.
3.2 Models and components
| Component | Used | Rationale |
|---|---|---|
| Answer model | Claude Opus 4.6 | Strongest general-purpose answerer at panel build time |
| Embedding model | OpenAI text-embedding-3-small | Cost / quality sweet spot at our retrieval budget |
| Cross-encoder reranker | Cohere rerank-v3.5 | Replaced gpt-4o-mini reranker in v4 (+0.2 pt on LME-S) |
| Fact extraction | Claude Opus 4.6 (async, write-path) | Same model the consolidator uses in prod |
| Question judge | GPT-4o (per LongMemEval & LoCoMo conventions) | Held constant; required by the official scripts |
All models reachable through the public APIs at run time. No proprietary weights, no fine-tuning, no per-benchmark prompts.
3.3 Pipeline configuration
CortexDB v1 was configured with the same defaults the public SaaS uses:
- Capture: raw turns appended to the WAL via
POST /v1/experience(one experience per turn). - Extract: the LLM extractor pulls subject/predicate/object triples into the Facts layer.
- Reconcile: the bi-temporal reconciler resolves contradictions (Bob said yes Tuesday, no Friday → newest wins, with the older version preserved as
valid_until=Tuesday). - Recall: holistic view with
diagnostics="none",include=["events","episodes","facts","beliefs"], default token budget. - Answer: Claude Opus 4.6 with citations enabled.
4. LongMemEval-S results
4.1 Per-category breakdown
| Category | Score | Detail |
|---|---|---|
| single-session-assistant | 100.0% | 56 / 56 |
| knowledge-update | 97.4% | 76 / 78 |
| single-session-user | 95.7% | 67 / 70 |
| single-session-preference | 93.3% | 28 / 30 |
| temporal-reasoning | 91.7% | 122 / 133 |
| multi-session | 90.2% | 120 / 133 |
| Overall | 93.8% | 469 / 500 |
4.2 Comparison to public results
| System | LongMemEval-S | Notes |
|---|---|---|
| CortexDB v1 (this work, server parity) | 93.8% | Claude Opus 4.6 + hybrid retrieval + Cohere rerank-v3.5 |
| Mem0 (published) | 93.4% | As reported in the Mem0 paper |
| LangMem (published) | 75.6% | LangChain memory adapter |
| MemGPT (published) | 69.3% | OS-style virtual context |
| GPT-4o long context (no memory layer) | 56.7% | Stuff every turn into the prompt |
| No memory baseline | 22.8% | Question + system message only |
CortexDB v1's lead over Mem0 is narrow (+0.4 pp) but it is the first reported result that uses an unmodified production write path. Other systems' published results typically use bespoke evaluation harnesses.
4.3 Cost and wall-clock
| Resource | Total |
|---|---|
| Wall clock | 2h 2m |
| LLM cost (write-path extraction) | $18.42 |
| LLM cost (read-path answer + judge) | $24.71 |
| Reranker (Cohere) | $4.56 |
| Embedding (OpenAI) | $2.00 |
| Total | $49.69 |
Reproducible end-to-end from benchmarks/longmemeval/RESULTS.md in the cortexv2 repo with a single invocation; the artifacts (cluster replay inputs, per-question pred/gold pairs, judge transcripts) are committed alongside.
5. LoCoMo results
5.1 Per-category breakdown
| Category | Score | Detail |
|---|---|---|
| Cat 4 — Single-hop | 91.6% | 770 / 841 |
| Cat 2 — Temporal | 87.9% | 282 / 321 |
| Cat 1 — Multi-hop | 79.8% | 225 / 282 |
| Cat 3 — Open-domain | 64.6% | 62 / 96 |
| Cats 1–4 overall | 86.9% | 1,339 / 1,540 |
We report categories 1–4 because category 5 ("adversarial") consists of questions whose answer is "I don't know"; the LoCoMo paper itself notes that cat 5 score is not comparable across systems because most systems' refusal behavior is upstream-prompt-dependent rather than memory-system-dependent.
5.2 Where the gap is
Cat 3 (open-domain) is the lowest at 64.6%. The category mixes questions whose answers depend on the conversation with questions whose answers depend on world knowledge the model has out-of-band. CortexDB v1 doesn't currently distinguish these cases; the LLM picks an answer source based on what the recall pack contains. Cat 3 is the headline target for V2's reasoning trace work.
6. Architectural drivers
Two architectural choices explain most of the lead over the field:
6.1 Bi-temporal Facts layer (drives temporal-reasoning categories)
CortexDB's Facts layer stores every extracted triple with both valid_from / valid_to (when the fact was true) and recorded_from / recorded_to (when the system learned about it). A question like "What was Bob's role on March 3rd?" hits a typed-store lookup against valid_from <= 2026-03-03 < valid_until, returning the correct historical role even if Bob's role has changed since. Systems that store memory as freeform text — including most "LLM-rewriting" memory layers — collapse this into a single current-state summary at write time, which permanently loses the answer to as-of questions.
This is the largest single contributor on LongMemEval-S's temporal-reasoning category (91.7%) and on LoCoMo cat 2 (87.9%).
6.2 Cross-encoder reranker (drives single-session categories)
The retrieval layer is hybrid (BM25 + HNSW + graph traversal). The reranker (Cohere rerank-v3.5) sorts the top-50 candidates by question-relevance using a cross-encoder. Replacing the earlier gpt-4o-mini reranker with Cohere rerank-v3.5 added +0.2 pp on LongMemEval-S overall, with most of the gain in the single-session-* categories where there's a single best span and the ranking is what matters.
6.3 Hybrid recall (drives multi-hop)
Multi-hop questions ("Who fixed the issue Alice reported?") require connecting two facts that don't share embedding-space neighborhood. The graph-traversal stage of recall follows entity edges from the seed match (the issue Alice reported) to the answer (the person who closed it). Disabling the graph stage drops the multi-hop category by ~13 pp.
7. Ablations
| Configuration | LongMemEval-S | Δ vs production |
|---|---|---|
| Production v1 pipeline (this report) | 93.8% | — |
| − Cohere rerank (use gpt-4o-mini) | 93.6% | −0.2 |
| − Graph traversal (BM25 + HNSW only) | 87.4% | −6.4 |
| − HNSW (BM25 + graph only) | 86.1% | −7.7 |
| − BM25 (HNSW + graph only) | 88.2% | −5.6 |
| − Bi-temporal facts layer (events only) | 81.0% | −12.8 |
| − Async extraction (no Facts at all) | 71.4% | −22.4 |
The single largest contributor is the asynchronous fact-extraction pipeline (−22.4 pp if removed). Bi-temporal storage is second (−12.8 pp). The hybrid retrieval components each contribute 5–8 pp; no single retrieval strategy is sufficient.
8. Operational characteristics
In addition to the accuracy numbers, the same runs report:
| Metric | Value | Notes |
|---|---|---|
| Write-path p50 latency | 4 ms | POST /v1/experience returns 202 with WAL offset |
| Write-path p99 latency | 12 ms | |
| Write-path error rate | 0.00% | 175,000 writes during the LongMemEval-S run, zero failures |
| Async-extraction completion (p50) | 18 s | Time from write to Facts visible |
| Recall p50 (holistic, 4 KB budget) | 489 ms | Includes hybrid retrieval + rerank |
| Answer p50 (Claude Opus 4.6) | 3.2 s | End-to-end including recall |
The write path is a disk append. There is no LLM call on the synchronous write path — that's what gives 4 ms p50 and 0% error rate. Extraction runs async, decoupled from ingest.
9. Threats to validity
- Benchmark coverage. LongMemEval-S and LoCoMo were not designed for production memory architectures; both bias toward conversational settings. Domains like code review, incident response, or enterprise-document Q&A are likely to score differently. We do not claim 93.8% generalizes outside conversational memory.
- Model availability. Claude Opus 4.6 and Cohere rerank-v3.5 are commercial endpoints. Self-hosted swaps (e.g., Llama-class answerer, BGE reranker) will score lower; our internal Llama-3.1-70B-Instruct + BGE-large-en-v1.5 swap reaches 88.4% on LongMemEval-S.
- Single-run variance. Both runs are single-attempt to avoid retry-selection bias. We've replicated the LongMemEval-S run six times across April–May 2026; the standard deviation is 0.31 pp. The number we report is the median of those runs, not the maximum.
- Judge subjectivity. LLM-judge scores both benchmarks. The official scripts use GPT-4o as judge; swapping to Claude Opus 4.6 as judge shifts scores by ±0.6 pp on average without changing the system ranking.
- Cat-5 omission on LoCoMo. Per §5.1, we report cats 1–4. Including cat 5 with our default refusal behavior gives 81.2% overall; we consider this number incomparable across systems and don't headline it.
10. Reproducibility
Both benchmarks are reproducible end-to-end from the public cortexv2 repo:
# LongMemEval-S — single command, ~2 hours, ~$50 of LLM cost
cd benchmarks/longmemeval
./run_full_panel.sh --tier server_parity --model claude-opus-4-6
# LoCoMo cats 1-4 — single command, ~6 hours, ~$120
cd benchmarks/locomo
./run_full_panel.sh --tier server_parity --model claude-opus-4-6
Outputs land in server_results/: one JSON per question (prediction, gold, judge verdict, retrieval trace), one Markdown summary, one CSV per-category, and a MANIFEST.json capturing the exact model versions, retrieval config, git SHA, and panel hash. Replay any single question with ./replay.sh <qid>.
Test panels, retrieval traces, and judge transcripts are committed to the repo so independent reviewers can audit individual question outcomes without re-running the panel.
11. References
- Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K.-W., & Yu, D. (2025). LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. ICLR 2025.
- Maharana, A., Lee, D., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating Very Long-Term Conversational Memory of LLM Agents. NAACL 2024.
- Chhikara, P., Shukla, A., Bhattacharya, S., et al. (2025). Mem0: A Memory-Centric Architecture for AI Agents.
- Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems.
- Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.