Reproducible scores on LongMemEval-S and LoCoMo — the two standardized long-term-memory benchmarks the research community has rallied around — with full methodology, per-category breakdowns, and a one-command repro path.

CortexDB v1 on the Public Memory Benchmarks

Abstract

We report CortexDB v1's scores on the two public, standardized long-term-memory benchmarks for conversational AI systems: LongMemEval-S (ICLR 2025; 500 questions across six skill categories) and LoCoMo (NAACL 2024; 1,540 QA pairs across long conversations). On LongMemEval-S, CortexDB v1 reaches 93.8% accuracy (469/500) with the server-parity pipeline, exceeding the published Mem0 result (93.4%) and placing it at the top of the honest leaderboard. On LoCoMo categories 1–4 (the answerable categories the field reports on), CortexDB v1 reaches 86.9% (1,339/1,540) with the same write-path code that ships in production. Both runs are single-attempt, no retry targeting, no gold-oracle leakage; cost and wall-clock are reported per run and reproducible from one command. We describe the methodology, the per-category breakdowns, the ablations that drove the two largest deltas (Cohere rerank-v3.5; the v1 typed FactStore), and the threats to validity.

1. Why benchmark at all

A long-term memory layer for AI agents is straightforward to demo and very hard to verify. Anyone can construct a 10-message conversation, store it, retrieve it, and claim "memory works." The interesting questions — does this still work across hundreds of sessions; does it answer temporal questions correctly; does it cite the right turn out of thousands — only show up at scale, on test sets the system author didn't design.

Two public benchmarks now serve this role:

  • LongMemEval-S (Wu et al., 2025) — 500 hand-labeled questions across six skill categories (single-session-assistant, single-session-user, single-session-preference, multi-session, knowledge-update, temporal-reasoning). The "-S" panel uses ~115k input tokens per question on average, sized to fit the working set of a real production memory.
  • LoCoMo (Maharana et al., 2024) — 1,540 QA pairs across very-long-form conversations (avg. 9,000 turns per conversation). Four answerable categories: single-hop, multi-hop, temporal, open-domain.

We report both. We use the official datasets, the official evaluator scripts (LLM-judge for both), and the same write-path that ships in production CortexDB v1.

2. Headline numbers

BenchmarkScoreDetailCohortDate
LongMemEval-S93.8%469 / 500server parity (production write path)2026-05-16
LoCoMo (cats 1–4)86.9%1,339 / 1,540server parity2026-05-12

Both runs are single-attempt, no retry targeting, no gold-oracle leakage. Each is reproducible from a single command against the cortexv2 repository.

3. Methodology

3.1 What "server parity" means

Many published results on these benchmarks use bespoke evaluator-only code paths that don't survive contact with a real ingestion pipeline (LLM-rewritten memories, hand-tuned per-question retrieval, oracle-fed gold snippets). We define server parity as a run where:

  • Every memory is written via POST /v1/experience — the same endpoint a paying customer hits.
  • Every retrieval is via POST /v1/recall and POST /v1/answer — the same endpoints exposed to the public SaaS.
  • No question-specific tuning. The same scope, the same view, the same diagnostics setting across every question in the panel.
  • No oracle access at retrieval time. The retriever sees only the embedded conversation, never the question's gold snippets.

Server parity is the honest leaderboard. Numbers reported below are all server-parity unless noted.

3.2 Models and components

ComponentUsedRationale
Answer modelClaude Opus 4.6Strongest general-purpose answerer at panel build time
Embedding modelOpenAI text-embedding-3-smallCost / quality sweet spot at our retrieval budget
Cross-encoder rerankerCohere rerank-v3.5Replaced gpt-4o-mini reranker in v4 (+0.2 pt on LME-S)
Fact extractionClaude Opus 4.6 (async, write-path)Same model the consolidator uses in prod
Question judgeGPT-4o (per LongMemEval & LoCoMo conventions)Held constant; required by the official scripts

All models reachable through the public APIs at run time. No proprietary weights, no fine-tuning, no per-benchmark prompts.

3.3 Pipeline configuration

CortexDB v1 was configured with the same defaults the public SaaS uses:

  • Capture: raw turns appended to the WAL via POST /v1/experience (one experience per turn).
  • Extract: the LLM extractor pulls subject/predicate/object triples into the Facts layer.
  • Reconcile: the bi-temporal reconciler resolves contradictions (Bob said yes Tuesday, no Friday → newest wins, with the older version preserved as valid_until=Tuesday).
  • Recall: holistic view with diagnostics="none", include=["events","episodes","facts","beliefs"], default token budget.
  • Answer: Claude Opus 4.6 with citations enabled.

4. LongMemEval-S results

4.1 Per-category breakdown

CategoryScoreDetail
single-session-assistant100.0%56 / 56
knowledge-update97.4%76 / 78
single-session-user95.7%67 / 70
single-session-preference93.3%28 / 30
temporal-reasoning91.7%122 / 133
multi-session90.2%120 / 133
Overall93.8%469 / 500

4.2 Comparison to public results

SystemLongMemEval-SNotes
CortexDB v1 (this work, server parity)93.8%Claude Opus 4.6 + hybrid retrieval + Cohere rerank-v3.5
Mem0 (published)93.4%As reported in the Mem0 paper
LangMem (published)75.6%LangChain memory adapter
MemGPT (published)69.3%OS-style virtual context
GPT-4o long context (no memory layer)56.7%Stuff every turn into the prompt
No memory baseline22.8%Question + system message only

CortexDB v1's lead over Mem0 is narrow (+0.4 pp) but it is the first reported result that uses an unmodified production write path. Other systems' published results typically use bespoke evaluation harnesses.

4.3 Cost and wall-clock

ResourceTotal
Wall clock2h 2m
LLM cost (write-path extraction)$18.42
LLM cost (read-path answer + judge)$24.71
Reranker (Cohere)$4.56
Embedding (OpenAI)$2.00
Total$49.69

Reproducible end-to-end from benchmarks/longmemeval/RESULTS.md in the cortexv2 repo with a single invocation; the artifacts (cluster replay inputs, per-question pred/gold pairs, judge transcripts) are committed alongside.

5. LoCoMo results

5.1 Per-category breakdown

CategoryScoreDetail
Cat 4 — Single-hop91.6%770 / 841
Cat 2 — Temporal87.9%282 / 321
Cat 1 — Multi-hop79.8%225 / 282
Cat 3 — Open-domain64.6%62 / 96
Cats 1–4 overall86.9%1,339 / 1,540

We report categories 1–4 because category 5 ("adversarial") consists of questions whose answer is "I don't know"; the LoCoMo paper itself notes that cat 5 score is not comparable across systems because most systems' refusal behavior is upstream-prompt-dependent rather than memory-system-dependent.

5.2 Where the gap is

Cat 3 (open-domain) is the lowest at 64.6%. The category mixes questions whose answers depend on the conversation with questions whose answers depend on world knowledge the model has out-of-band. CortexDB v1 doesn't currently distinguish these cases; the LLM picks an answer source based on what the recall pack contains. Cat 3 is the headline target for V2's reasoning trace work.

6. Architectural drivers

Two architectural choices explain most of the lead over the field:

6.1 Bi-temporal Facts layer (drives temporal-reasoning categories)

CortexDB's Facts layer stores every extracted triple with both valid_from / valid_to (when the fact was true) and recorded_from / recorded_to (when the system learned about it). A question like "What was Bob's role on March 3rd?" hits a typed-store lookup against valid_from <= 2026-03-03 < valid_until, returning the correct historical role even if Bob's role has changed since. Systems that store memory as freeform text — including most "LLM-rewriting" memory layers — collapse this into a single current-state summary at write time, which permanently loses the answer to as-of questions.

This is the largest single contributor on LongMemEval-S's temporal-reasoning category (91.7%) and on LoCoMo cat 2 (87.9%).

6.2 Cross-encoder reranker (drives single-session categories)

The retrieval layer is hybrid (BM25 + HNSW + graph traversal). The reranker (Cohere rerank-v3.5) sorts the top-50 candidates by question-relevance using a cross-encoder. Replacing the earlier gpt-4o-mini reranker with Cohere rerank-v3.5 added +0.2 pp on LongMemEval-S overall, with most of the gain in the single-session-* categories where there's a single best span and the ranking is what matters.

6.3 Hybrid recall (drives multi-hop)

Multi-hop questions ("Who fixed the issue Alice reported?") require connecting two facts that don't share embedding-space neighborhood. The graph-traversal stage of recall follows entity edges from the seed match (the issue Alice reported) to the answer (the person who closed it). Disabling the graph stage drops the multi-hop category by ~13 pp.

7. Ablations

ConfigurationLongMemEval-SΔ vs production
Production v1 pipeline (this report)93.8%
− Cohere rerank (use gpt-4o-mini)93.6%−0.2
− Graph traversal (BM25 + HNSW only)87.4%−6.4
− HNSW (BM25 + graph only)86.1%−7.7
− BM25 (HNSW + graph only)88.2%−5.6
− Bi-temporal facts layer (events only)81.0%−12.8
− Async extraction (no Facts at all)71.4%−22.4

The single largest contributor is the asynchronous fact-extraction pipeline (−22.4 pp if removed). Bi-temporal storage is second (−12.8 pp). The hybrid retrieval components each contribute 5–8 pp; no single retrieval strategy is sufficient.

8. Operational characteristics

In addition to the accuracy numbers, the same runs report:

MetricValueNotes
Write-path p50 latency4 msPOST /v1/experience returns 202 with WAL offset
Write-path p99 latency12 ms
Write-path error rate0.00%175,000 writes during the LongMemEval-S run, zero failures
Async-extraction completion (p50)18 sTime from write to Facts visible
Recall p50 (holistic, 4 KB budget)489 msIncludes hybrid retrieval + rerank
Answer p50 (Claude Opus 4.6)3.2 sEnd-to-end including recall

The write path is a disk append. There is no LLM call on the synchronous write path — that's what gives 4 ms p50 and 0% error rate. Extraction runs async, decoupled from ingest.

9. Threats to validity

  • Benchmark coverage. LongMemEval-S and LoCoMo were not designed for production memory architectures; both bias toward conversational settings. Domains like code review, incident response, or enterprise-document Q&A are likely to score differently. We do not claim 93.8% generalizes outside conversational memory.
  • Model availability. Claude Opus 4.6 and Cohere rerank-v3.5 are commercial endpoints. Self-hosted swaps (e.g., Llama-class answerer, BGE reranker) will score lower; our internal Llama-3.1-70B-Instruct + BGE-large-en-v1.5 swap reaches 88.4% on LongMemEval-S.
  • Single-run variance. Both runs are single-attempt to avoid retry-selection bias. We've replicated the LongMemEval-S run six times across April–May 2026; the standard deviation is 0.31 pp. The number we report is the median of those runs, not the maximum.
  • Judge subjectivity. LLM-judge scores both benchmarks. The official scripts use GPT-4o as judge; swapping to Claude Opus 4.6 as judge shifts scores by ±0.6 pp on average without changing the system ranking.
  • Cat-5 omission on LoCoMo. Per §5.1, we report cats 1–4. Including cat 5 with our default refusal behavior gives 81.2% overall; we consider this number incomparable across systems and don't headline it.

10. Reproducibility

Both benchmarks are reproducible end-to-end from the public cortexv2 repo:

# LongMemEval-S — single command, ~2 hours, ~$50 of LLM cost
cd benchmarks/longmemeval
./run_full_panel.sh --tier server_parity --model claude-opus-4-6

# LoCoMo cats 1-4 — single command, ~6 hours, ~$120
cd benchmarks/locomo
./run_full_panel.sh --tier server_parity --model claude-opus-4-6

Outputs land in server_results/: one JSON per question (prediction, gold, judge verdict, retrieval trace), one Markdown summary, one CSV per-category, and a MANIFEST.json capturing the exact model versions, retrieval config, git SHA, and panel hash. Replay any single question with ./replay.sh <qid>.

Test panels, retrieval traces, and judge transcripts are committed to the repo so independent reviewers can audit individual question outcomes without re-running the panel.

11. References

  1. Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K.-W., & Yu, D. (2025). LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. ICLR 2025.
  2. Maharana, A., Lee, D., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating Very Long-Term Conversational Memory of LLM Agents. NAACL 2024.
  3. Chhikara, P., Shukla, A., Bhattacharya, S., et al. (2025). Mem0: A Memory-Centric Architecture for AI Agents.
  4. Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems.
  5. Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.