CortexDB vs the Competition: A Production-Scale Memory Benchmark

Q: Why is the write-path cost difference 337x?

The baseline runs an LLM call on every write to merge content with existing memories, scaling linearly with write volume. CortexDB's write path is a disk append followed by async knowledge graph enrichment on a separate worker pool. Daily cost at 100 episodes per minute: baseline $9,628, CortexDB $28.60.

Q: How does this compare to LongMemEval and LoCoMo?

This benchmark is internal and synthetic, against a generic LLM-rewrite-on-write baseline. For published third-party benchmarks, CortexDB reports 93.8% on LongMemEval-S (469 of 500) and 86.9% on LoCoMo (cats 1 to 4).

Q: What hardware was used?

AWS r6g.2xlarge instances (8 vCPU, 64 GB RAM, gp3 SSD with 3,000 IOPS baseline). Both systems ran on identical hardware to isolate the architectural variable.

In a controlled, production-scale internal benchmark, CortexDB achieved 84.2% retrieval accuracy compared to an LLM-rewrite-on-write baseline’s 31.8%, while operating at 1/337th the daily write-path cost with a 0.0% error rate. CortexDB—the long-term memory layer for AI agents built by Apache Cassandra co-creator Prashant Malik—achieves this massive performance gap through a lossless event-sourced memory architecture. This post presents an internal comparison across five simulated production-scale scenarios where identical LLMs, identical embedding models, and identical test data isolate the memory architecture as the sole variable.

Note

Methodology note. This post describes an internal, synthetic benchmark against a baseline configuration: a vector-database plus LLM-rewrite-on-write architecture. The numbers below come from this internal synthetic scenario, not third-party reproduction. For results on published, third-party benchmarks, CortexDB reports 93.8% on LongMemEval-S and 86.9% on LoCoMo categories 1-4. See the LongMemEval blog post for details.

Methodology

Controlled variables

The memory architecture is the sole independent variable. Every other component is held constant.

| Variable | Value | | tag: "Benchmark" ---|---| | LLM | GPT-4o | | Embedding model | OpenAI text-embedding-3-small (1536 dim) for the benchmark; production CortexDB defaults to Cohere embed-english-v3.0 (1024 dim) | | Test data | 10,000 episodes across 5 scenarios | | Query set | 500 queries, hand-labelled with ground truth | | Hardware | AWS r6g.2xlarge (8 vCPU, 64 GB RAM, gp3 SSD) | | Evaluation | Automated scoring plus human review of disagreements |

Evaluation criteria

Every query was scored binary correct or incorrect, where correct means the system returned the ground-truth answer in its top-5 results. Additional metrics:

Retrieval accuracy. Percentage of queries where the correct answer appeared in the top-5.
Mean Reciprocal Rank (MRR). How highly the correct answer was ranked.
Ingest throughput. Episodes per second sustained over 10 minutes.
Write-path error rate. Percentage of writes that failed or produced errors.
Write-path latency. p50 and p99 latency for episode ingestion.
Cost per day. Estimated cost for continuous ingestion at 100 episodes per minute.

What was measured and what was not

The benchmark measures retrieval quality and operational characteristics. The test does not measure model quality, feature breadth, or integration surface area. The benchmark measures one thing: does the memory system itself affect how well an AI agent can remember and retrieve information?

Test scenarios

Scenario 1: Engineering incident response

Setup. 2,000 episodes from a simulated 6-month engineering organisation. Slack messages, PagerDuty alerts, GitHub PRs, Jira tickets, and post-mortem documents related to 15 distinct incidents.

Queries. "What was the root cause of the payments outage on March 3rd", "which incidents were related to the Redis cluster", "what remediation was done after the last database failover".

Why this matters. Incident knowledge accumulates over time, involves many entities, and requires temporal reasoning.

Scenario 2: Deployment tracking

Setup. 1,500 episodes tracking deployments across 8 microservices over 4 months. Deploy events, rollback events, configuration changes, performance regressions, and related Slack discussions.

Queries. "When was the last time we deployed the auth service", "what caused the rollback on February 15th", "which services had config changes in the last week".

Why this matters. Deployment history is factual and temporal. Getting the wrong deploy date or confusing two deployments has real operational consequences.

Scenario 3: Knowledge synthesis

Setup. 3,000 episodes from a product team's daily operations: standups, sprint retrospectives, product requirement documents, design decisions, and customer feedback.

Queries. "What features did we decide not to build in Q4", "what was the rationale for choosing gRPC over REST", "summarise the customer feedback about the dashboard".

Why this matters. Synthesis queries require combining information from multiple episodes spread over weeks or months. Lossy summarisation approaches struggle most here.

Scenario 4: People and relationships

Setup. 2,000 episodes across a 50-person engineering organisation. Team assignments, project ownership, mentorship relationships, meeting notes, and organisational announcements.

Queries. "Who owns the payments service", "what projects has Alice worked on", "who reported to the VP of engineering before the reorg".

Why this matters. People queries require entity tracking and relationship awareness. Answering "who reported to whom before the reorg" requires temporal reasoning about relationships.

Scenario 5: Multi-session continuity

Setup. 1,500 episodes from an AI assistant serving a single user across 100 distinct sessions over 6 months. Preferences, past decisions, project context, and personal information.

Queries. "What are my notification preferences", "what did I decide about the vacation policy last month", "when did I last update my portfolio allocation".

Why this matters. This is the core AI companion usecase. The agent must maintain continuity across sessions without losing or confusing personal information.

Results

Overall accuracy

System	Retrieval accuracy	MRR	p50 write latency	p99 write latency
CortexDB	84.2%	0.78	4 ms	12 ms
Baseline	31.8%	0.24	680 ms	2,400 ms

The 52.4-point accuracy gap is structural, not implementation-specific. The 170x p50 latency gap follows from the architectural difference between an LLM call on every write and a disk append.

Accuracy by scenario

Scenario	CortexDB	Baseline	Delta
Incident response	86%	28%	+58
Deployment tracking	91%	35%	+56
Knowledge synthesis	78%	29%	+49
People and relationships	82%	34%	+48
Multi-session continuity	84%	33%	+51

CortexDB reports higher accuracy across all five scenarios. The largest deltas come from scenarios with the most temporal reasoning (incident response, deployment tracking).

Ingest performance

Metric	CortexDB	Baseline
Sustained throughput	850 eps	340 eps
p50 write latency	4 ms	680 ms
p99 write latency	12 ms	2,400 ms
Write-path error rate	0.0%	4.2%
Write-path LLM calls	0	10,000

CortexDB does not invoke an LLM on the write path. The baseline calls an LLM for every episode to merge it with existing memories, which introduces latency, cost, and a non-zero failure rate.

Write-path cost analysis

At a sustained ingestion rate of 100 episodes per minute (a moderate production workload for a team of 50 engineers):

Cost component	CortexDB	Baseline
LLM calls on write path	$0 per day	$9,600 per day
Embedding generation	$18 per day	$18 per day
Compute (`r6g.2xlarge`)	$8.20 per day	$8.20 per day
Storage (`gp3` SSD)	$2.40 per day	$1.80 per day
Total	$28.60 per day	$9,628 per day

The cost difference is 337x. The baseline must call the LLM for every write. CortexDB builds the knowledge graph via asynchronous extraction, keeping the write path LLM-free.

Note

The baseline's write-path LLM cost is calculated from GPT-4o pricing at approximately 1,000 tokens per memory operation. Actual costs vary with episode length and LLM pricing changes.

Why CortexDB preserves more information

The accuracy gap is a direct consequence of how each system handles incoming information.

The baseline: merge and lose

When the baseline receives a new episode, it calls the LLM to merge it with existing memories.

New episode: "Alice moved from the payments team to the platform team last week."

Existing memory: "Alice works on the payments team and owns the checkout service."

LLM merge result: "Alice works on the platform team."

The merge loses Alice's history on the payments team, the fact that she owned the checkout service, the timing of the transition ("last week"), and the fact that a transition happened at all. When a later query asks "who used to own the checkout service", the answer is gone.

CortexDB: preserve and connect

CortexDB preserves the original episode and builds connected context through asynchronous extraction. When a later query asks "who used to own the checkout service", the original episode is still on the log, attached to Alice's entity node, with the original timestamp preserved.

Detailed scenario breakdowns

Incident response

The incident response scenario exposes the dominant difference. Incidents unfold over hours or days, involving dozens of messages, alerts, and code changes. Temporal ordering is critical to root-cause analysis.

Example query. "What was the root cause of the March 3rd payments outage."

CortexDB retrieves the post-mortem with the exact root cause, plus the original PagerDuty alert and the Slack thread where the team debugged the issue. The original episodes are preserved and connected through entity relationships in the knowledge graph.

The baseline returns a generic summary about payments issues, missing the specific root cause and timeline. The details were lost through repeated LLM merges.

Deployment tracking

Deployment queries are factual. "When was the last deploy" has one correct answer. There is no room for approximate semantic matching.

CortexDB supports time-range queries natively and retrieves specific deployment episodes with exact dates and details through 4-channel hybrid retrieval (BM25 + HNSW vectors + graph traversal + cross-encoder reranking).

The baseline's merged memories collapsed multiple configuration changes into summaries like "various config changes were made to the auth and payments services recently". The specific dates, the specific changes, and the specific services were lost.

Knowledge synthesis

Synthesis queries require aggregating information from many episodes. "What features did we decide not to build in Q4" might require information from 15 different meeting notes and Slack conversations.

CortexDB retrieves all relevant episodes through Cognitive Recall and lets the calling LLM synthesise them. Because the original episodes are preserved on the lossless event-sourced log, no detail is lost before synthesis.

The baseline's merged memories had already synthesised the information through repeated lossy LLM merges. Features discussed briefly in one meeting and decided against in another had been dropped entirely from the merged memory.

Methodology notes

Alignment with LoCoMo

The evaluation methodology is informed by the LoCoMo benchmark (Maharana et al., 2024). Key alignments:

Multi-hop reasoning queries. Questions that require connecting information across multiple episodes.
Temporal reasoning queries. Questions about when events occurred relative to each other.
Factual precision scoring. Binary correct or incorrect rather than subjective quality ratings.

The benchmark extends LoCoMo in two ways:

Production-scale data volumes. 10,000 episodes against LoCoMo's conversation-scale data.
Operational metrics. Write latency, error rates, and cost, which are critical for production deployment but not covered by LoCoMo.

Limitations

The baseline was tested via its cloud API. Self-hosted performance may differ.
Both systems were configured with default settings. Tuning may improve either system's results.
The test scenarios weight toward engineering and DevOps use cases. Results may vary for other domains.
LLM pricing changes over time. Cost figures are based on March 2025 pricing.

Frequently asked questions

What does the production-scale benchmark measure?

The internal benchmark measures retrieval accuracy, MRR, ingest throughput, write-path latency, write-path error rate, and daily operational cost across five production-scale scenarios (10,000 episodes total, 500 hand-labelled queries). CortexDB reports 84.2% retrieval accuracy against the LLM-rewrite-on-write baseline's 31.8%, with a 337:1 daily cost ratio at moderate scale.

Why is the write-path cost difference 337x?

The baseline runs an LLM call on every write to merge new content with existing memories, which scales linearly with write volume. CortexDB runs no LLM on the write path; the write path is a disk append followed by async knowledge graph enrichment on a separate worker pool. At 100 episodes per minute, the baseline's daily LLM cost is approximately $9,600. CortexDB's daily cost is $28.60.

How does this compare to LongMemEval and LoCoMo?

This benchmark is internal and synthetic. It measures CortexDB against a generic LLM-rewrite-on-write baseline, not a specific named system. For published, third-party benchmarks, CortexDB reports 93.8% on LongMemEval-S (469 of 500) and 86.9% on LoCoMo (cats 1 to 4). See the LongMemEval blog post for details.

Why does CortexDB outperform on incident response?

Incident response queries require temporal ordering, multi-source aggregation, and root-cause traceability. CortexDB preserves every original episode with its timestamp on the lossless event-sourced log, connects entities across PagerDuty alerts and Slack threads and GitHub PRs through async knowledge graph enrichment, and retrieves the connected context through 4-channel hybrid retrieval. The baseline's LLM-rewriting collapses temporal detail into summaries.

Are the benchmark results reproducible?

CortexDB reports these internal results in public docs. A public reproduction repository for this specific internal benchmark suite is not yet reachable. For independently verified benchmarks, see our LongMemEval and LoCoMo results.

What hardware was used?

AWS r6g.2xlarge instances (8 vCPU, 64 GB RAM, gp3 SSD with 3,000 IOPS baseline). Both systems ran on identical hardware to isolate the architectural variable.

What was the LLM used for evaluation?

Both systems used GPT-4o. CortexDB's write path runs no LLM; the LLM is invoked only during Cognitive Recall and during async knowledge graph enrichment. The baseline invoked GPT-4o on every write to merge new episodes with existing memories.

Conclusion

The memory system matters more than the model. Both systems used the same LLM and the same embedding model, yet we report CortexDB achieved 2.6x the retrieval accuracy at 1/337th the daily write-path cost.

CortexDB preserves information through lossless event sourcing. LLM-rewriting baselines lose information at every write. The benchmark results follow directly from this architectural difference.