A controlled comparison of CortexDB against the leading alternative on real-world AI agent scenarios. CortexDB achieves 84% accuracy vs 32%, with 2.5x faster ingest and zero write-path errors.

CortexDB vs the Competition: A Production-Scale Memory Benchmark

AI memory systems make bold claims about retrieval quality, but the industry lacks rigorous, reproducible benchmarks for comparing them. We set out to fix that.

This post presents a controlled, head-to-head comparison between CortexDB and the leading alternative across five production-scale scenarios. We used identical LLMs, identical embedding models, and identical test data. The only variable was the memory system itself.

Methodology

Controlled Variables

To isolate the effect of the memory system, we held everything else constant:

| Variable | Value | |---|---| | LLM | GPT-4o (2025-01-01) | | Embedding model | text-embedding-3-large (3072 dimensions) | | Test data | 10,000 episodes across 5 scenarios | | Query set | 500 queries, hand-labeled with ground-truth answers | | Hardware | AWS r6g.2xlarge (8 vCPU, 64 GB RAM, gp3 SSD) | | Evaluation | Automated scoring + human review of disagreements |

Evaluation Criteria

Each query was scored on a binary correct/incorrect basis, where "correct" means the system returned the ground-truth answer in its top-5 results. We also measured:

  • Retrieval accuracy -- percentage of queries where the correct answer appeared in top-5
  • Mean Reciprocal Rank (MRR) -- how highly the correct answer was ranked
  • Ingest throughput -- episodes per second sustained over 10 minutes
  • Write-path error rate -- percentage of writes that failed or produced errors
  • Write-path latency -- p50 and p99 latency for episode ingestion
  • Cost per day -- estimated cost for continuous ingestion at 100 episodes/minute

What We Measured (and What We Did Not)

This benchmark measures retrieval quality and operational characteristics. It does not measure:

  • Model quality (both systems used the same LLM and embedding model)
  • Feature breadth (the baseline system has features CortexDB lacks, and vice versa)
  • Ease of integration (both have Python SDKs with similar APIs)

We are measuring one thing: does the memory system itself affect how well an AI agent can remember and retrieve information?

Test Scenarios

Scenario 1: Engineering Incident Response

Setup: 2,000 episodes from a simulated 6-month engineering organization. Includes Slack messages, PagerDuty alerts, GitHub PRs, Jira tickets, and post-mortem documents related to 15 distinct incidents.

Queries: "What was the root cause of the payments outage on March 3rd?", "Which incidents were related to the Redis cluster?", "What remediation was done after the last database failover?"

Why this matters: Incident knowledge is the canonical example of information that accumulates over time, involves many entities, and requires temporal reasoning.

Scenario 2: Deployment Tracking

Setup: 1,500 episodes tracking deployments across 8 microservices over 4 months. Includes deploy events, rollback events, config changes, performance regressions, and related Slack discussions.

Queries: "When was the last time we deployed the auth service?", "What caused the rollback on February 15th?", "Which services had config changes in the last week?"

Why this matters: Deployment history is factual and temporal. Getting the wrong deploy date or confusing two deployments has real operational consequences.

Scenario 3: Knowledge Synthesis

Setup: 3,000 episodes from a product team's daily operations: standups, sprint retrospectives, product requirement documents, design decisions, and customer feedback.

Queries: "What features did we decide not to build in Q4?", "What was the rationale for choosing gRPC over REST?", "Summarize the customer feedback about the dashboard."

Why this matters: Synthesis queries require combining information from multiple episodes spread over weeks or months. This is where lossy summarization approaches struggle most.

Scenario 4: People and Relationships

Setup: 2,000 episodes across a 50-person engineering organization. Includes team assignments, project ownership, mentorship relationships, meeting notes, and organizational announcements.

Queries: "Who owns the payments service?", "What projects has Alice worked on?", "Who reported to the VP of Engineering before the reorg?"

Why this matters: People queries require entity tracking and relationship awareness. Answering "who reported to whom before the reorg" requires temporal reasoning about relationships.

Scenario 5: Multi-Session Continuity

Setup: 1,500 episodes from an AI assistant serving a single user across 100 distinct sessions over 6 months. Includes preferences, past decisions, project context, and personal information.

Queries: "What are my notification preferences?", "What did I decide about the vacation policy last month?", "When did I last update my portfolio allocation?"

Why this matters: This is the core AI companion use case. The agent must maintain continuity across sessions without losing or confusing personal information.

Results

Overall Accuracy

| System | Retrieval Accuracy | MRR | p50 Write Latency | p99 Write Latency | |---|---|---|---|---| | CortexDB | 84.2% | 0.78 | 4ms | 12ms | | Baseline | 31.8% | 0.24 | 680ms | 2,400ms |

Accuracy by Scenario

| Scenario | CortexDB | Baseline | Delta | |---|---|---|---| | Incident Response | 86% | 28% | +58 | | Deployment Tracking | 91% | 35% | +56 | | Knowledge Synthesis | 78% | 29% | +49 | | People & Relationships | 82% | 34% | +48 | | Multi-Session Continuity | 84% | 33% | +51 |

Ingest Performance

| Metric | CortexDB | Baseline | |---|---|---| | Sustained throughput | 850 eps | 340 eps | | p50 write latency | 4ms | 680ms | | p99 write latency | 12ms | 2,400ms | | Write-path error rate | 0.0% | 4.2% | | Write-path LLM calls | 0 | 10,000 |

CortexDB does not require LLM involvement on the write path. The baseline calls the LLM for every episode to merge it with existing memories, which introduces latency, cost, and a non-zero failure rate.

Write-Path Cost Analysis

At a sustained ingestion rate of 100 episodes per minute (a moderate production workload for a team of 50 engineers):

| Cost Component | CortexDB | Baseline | |---|---|---| | LLM calls on write path | $0/day | $9,600/day | | Embedding generation | $18/day | $18/day | | Compute (r6g.2xlarge) | $8.20/day | $8.20/day | | Storage (gp3 SSD) | $2.40/day | $1.80/day | | Total | $28.60/day | $9,628/day |

The cost difference is 337x. The baseline system must call the LLM for every write. CortexDB processes information asynchronously, keeping the write path fast and LLM-free.

Note

The baseline's write-path LLM cost is calculated from GPT-4o pricing at ~1,000 tokens per memory operation. Actual costs vary based on episode length and LLM pricing changes.

Why CortexDB Preserves More Information

The results come down to how each system handles incoming information.

The Alternative: Merge and Lose

When the alternative system receives a new episode, it calls the LLM to merge it with existing memories:

New episode: "Alice moved from the payments team to the platform team last week."

Existing memory: "Alice works on the payments team and owns the checkout service."

LLM merge result: "Alice works on the platform team."

What was lost?

  • Alice's history on the payments team
  • That she owned the checkout service
  • The timing of the transition ("last week")
  • The fact that she moved (implying she was somewhere before)

When you later ask "Who used to own the checkout service?", the answer is gone.

CortexDB: Preserve and Connect

CortexDB preserves the original episode and builds connected context from it. When you later ask "Who used to own the checkout service?", the original information is still there, along with the temporal context of when relationships changed.

Detailed Scenario Breakdowns

Incident Response: Where Temporal Context Is Everything

The incident response scenario exposed the most dramatic difference. Incidents unfold over hours or days, involving dozens of messages, alerts, and code changes. The temporal ordering of events is critical to understanding root cause.

Example query: "What was the root cause of the March 3rd payments outage?"

CortexDB retrieves the post-mortem with the exact root cause, plus the original PagerDuty alert and the Slack thread where the team debugged it -- because the original episodes are preserved and connected through entity relationships.

The baseline returns a generic summary about payments issues, missing the specific root cause and timeline, because the details were lost during repeated LLM merges.

CortexDB scored 86% on incident response queries. The baseline system scored 28%.

Deployment Tracking: Where Precision Matters

Deployment queries are factual. "When was the last deploy?" has one correct answer. There is no room for approximate semantic matching.

CortexDB supports time-range queries natively and retrieves specific deployment episodes with exact dates and details.

The baseline system's merged memories had collapsed multiple config changes into summaries like "Various config changes were made to the auth and payments services recently." The specific dates, the specific changes, and the specific services were lost.

CortexDB scored 91% on deployment tracking. The baseline system scored 35%.

Knowledge Synthesis: Where Completeness Matters

Synthesis queries require aggregating information from many episodes. "What features did we decide not to build in Q4?" might require information from 15 different meeting notes and Slack conversations.

CortexDB retrieves all relevant episodes and lets the calling LLM synthesize them. Because the original episodes are preserved, no detail is lost before synthesis.

The baseline system's merged memories had already synthesized the information -- but through the lossy lens of repeated LLM merges. Features that were discussed briefly in one meeting and decided against in another had been dropped entirely from the merged memory.

CortexDB scored 78% on knowledge synthesis. The baseline system scored 29%.

Reproducing These Results

We are publishing the complete benchmark suite publicly:

git clone https://github.com/cortexdb/cortex-benchmark
cd cortex-benchmark

# Generate test data
python generate_scenarios.py --episodes 10000

# Run CortexDB benchmark
python benchmark.py --system cortexdb --endpoint http://localhost:8080

# Run baseline benchmark
python benchmark.py --system baseline --api-key $BASELINE_API_KEY

# Score results
python evaluate.py --results results/

The test data generator, query set, ground-truth labels, and scoring methodology are all included. We encourage the community to run these benchmarks independently and report results.

Methodology Notes

Alignment with LOCOMO Benchmark

Our evaluation methodology is informed by the LOCOMO benchmark (Maharana et al., 2024), which established a standardized framework for evaluating long-context memory in conversational AI. Key alignments:

  • Multi-hop reasoning queries -- questions that require connecting information across multiple episodes
  • Temporal reasoning queries -- questions about when events occurred relative to each other
  • Factual precision scoring -- binary correct/incorrect rather than subjective quality ratings

We extended LOCOMO's framework in two ways:

  1. Production-scale data volumes -- 10,000 episodes vs. LOCOMO's conversation-scale data
  2. Operational metrics -- write latency, error rates, and cost, which are critical for production deployments but not covered by LOCOMO

Limitations

  • The baseline system was tested via its cloud API. Self-hosted performance may differ.
  • Both systems were configured with default settings. Tuning may improve either system's results.
  • The test scenarios are weighted toward engineering and DevOps use cases. Results may vary for other domains.
  • LLM pricing changes over time. Cost figures are based on March 2025 pricing.

Conclusion

The memory system matters more than the model. Both systems used the same LLM and the same embedding model, yet CortexDB achieved 2.6x the retrieval accuracy at 1/337th the write-path cost.

CortexDB preserves information. LLM-rewriting alternatives lose it. The benchmark results are a direct consequence of this difference.