A rigorous evaluation of event-sourced memory vs LLM-rewriting memory architectures across five production-scale scenarios, measuring retrieval accuracy, ingest performance, and operational cost.

CortexDB: Event-Sourced Memory for AI Systems -- A Production Benchmark

Abstract

We present a controlled benchmark comparing two architectural approaches to long-term memory for AI systems: event-sourced memory (CortexDB) and the leading LLM-rewriting system. Using identical language models, embedding models, and test data across five production-scale scenarios, we measure retrieval accuracy, ingest throughput, operational reliability, and cost. CortexDB achieves 84.2% retrieval accuracy compared to 31.8% for the LLM-rewriting approach, with 2.5x higher ingest throughput, zero write-path errors (vs. 4.2%), and estimated daily operational costs of $29 vs. $9,628 at moderate production scale. We attribute these differences to the fundamental information-preserving property of event sourcing versus the information-destroying property of LLM rewriting. We release the full benchmark suite, test data, and evaluation methodology publicly to enable independent reproduction.

1. Introduction

Long-term memory is an emerging requirement for AI systems that operate across sessions, accumulate knowledge over time, and serve users who expect continuity and personalization. The problem is well-characterized: large language models have no persistent state between invocations, and context windows, while growing, remain insufficient for encoding months or years of interaction history.

Two architectural approaches have emerged:

LLM-rewriting memory. When new information arrives, a language model reads the new information alongside existing stored memories and produces a merged, summarized version. The merged version replaces the previous memories. This approach is implemented by several commercial systems.

Event-sourced memory. When new information arrives, it is appended to an immutable log as a discrete event. The original content is never modified. Structured views (vector indexes, keyword indexes, knowledge graphs) are derived asynchronously from the event log. This approach is implemented by CortexDB.

This paper evaluates whether the choice of architecture meaningfully affects retrieval quality, operational characteristics, and cost at production scale.

2. Methodology

2.1 Experimental Design

We adopt a controlled experimental design where the memory architecture is the sole independent variable. All other components are held constant:

| Variable | Value | Rationale | |---|---|---| | Language model | GPT-4o (2025-01-01 checkpoint) | Current state-of-the-art general-purpose LLM | | Embedding model | text-embedding-3-large (3072d) | High-dimensional embeddings for maximum discriminability | | Test data | 10,000 episodes, 5 scenarios | Sufficient for statistical significance | | Query set | 500 queries with ground-truth labels | Hand-labeled by domain experts | | Hardware | AWS r6g.2xlarge (8 vCPU, 64 GB) | Representative production instance | | Storage | gp3 SSD, 3000 IOPS baseline | Standard cloud block storage |

2.2 Test Scenarios

We designed five scenarios representative of production AI memory workloads:

Scenario 1: Engineering Incident Response (2,000 episodes). Simulated 6-month engineering organization with Slack messages, PagerDuty alerts, GitHub PRs, Jira tickets, and post-mortem documents across 15 distinct incidents. Queries require temporal reasoning, entity correlation, and multi-hop retrieval.

Scenario 2: Deployment Tracking (1,500 episodes). Deployment events, rollbacks, config changes, and related discussions across 8 microservices over 4 months. Queries require factual precision (exact dates, specific services) and temporal ordering.

Scenario 3: Knowledge Synthesis (3,000 episodes). Product team operations including standups, retrospectives, PRDs, design decisions, and customer feedback. Queries require aggregating information from multiple episodes across weeks or months.

Scenario 4: People and Relationships (2,000 episodes). Organizational data including team assignments, project ownership, mentorship, meeting notes, and announcements for a 50-person engineering organization. Queries require entity extraction, relationship tracking, and temporal graph reasoning.

Scenario 5: Multi-Session Continuity (1,500 episodes). AI assistant serving a single user across 100 sessions over 6 months. Includes preferences, decisions, project context, and personal information. Queries require cross-session continuity and preference recall.

2.3 Query Design

For each scenario, we created 100 queries (500 total) across four categories:

| Category | Count | Description | |---|---|---| | Factual | 150 | Single correct answer (e.g., "When was the last deploy of auth-service?") | | Temporal | 100 | Requires time-based reasoning (e.g., "What changed between March 1 and March 15?") | | Multi-hop | 125 | Requires connecting information across episodes (e.g., "Who fixed the issue Alice reported?") | | Synthesis | 125 | Requires aggregating information from multiple episodes (e.g., "Summarize Q4 decisions") |

Each query has a hand-labeled ground-truth answer specifying which episodes must appear in the top-5 results for the query to be scored as correct.

2.4 Evaluation Metrics

Retrieval accuracy (primary). Binary correct/incorrect. A query is correct if any of its ground-truth episodes appear in the system's top-5 results.

Mean Reciprocal Rank (MRR). The reciprocal of the rank of the first correct result, averaged across all queries. Measures not just whether the answer is found, but how highly it is ranked.

Ingest throughput. Episodes per second, sustained over a 10-minute window.

Write-path latency. p50 and p99 latency for single-episode ingestion.

Write-path error rate. Percentage of write operations that return an error or timeout.

Operational cost. Estimated daily cost at a sustained ingestion rate of 100 episodes per minute.

2.5 Alignment with LOCOMO

Our methodology is informed by the LOCOMO benchmark (Maharana et al., 2024), which establishes a standardized framework for evaluating long-context memory in conversational AI. We adopt LOCOMO's emphasis on multi-hop reasoning queries, temporal reasoning queries, and factual precision scoring.

We extend the LOCOMO framework in two dimensions:

  1. Scale. LOCOMO operates at conversation scale (hundreds of turns). We operate at production scale (10,000 episodes from multiple sources). This tests whether the architecture scales beyond single-conversation contexts.

  2. Operational metrics. LOCOMO measures retrieval quality only. We additionally measure ingest throughput, latency, error rates, and cost -- metrics critical for production deployment decisions.

3. Results

3.1 Overall Retrieval Accuracy

| System | Accuracy | MRR | 95% CI (Accuracy) | |---|---|---|---| | CortexDB | 84.2% | 0.78 | [81.0%, 87.4%] | | Baseline | 31.8% | 0.24 | [27.7%, 35.9%] |

The difference of 52.4 percentage points is statistically significant (p < 0.001, two-proportion z-test).

3.2 Accuracy by Scenario

| Scenario | CortexDB | Baseline | Delta | p-value | |---|---|---|---|---| | Incident Response | 86.0% | 28.0% | +58.0 | < 0.001 | | Deployment Tracking | 91.0% | 35.0% | +56.0 | < 0.001 | | Knowledge Synthesis | 78.0% | 29.0% | +49.0 | < 0.001 | | People & Relationships | 82.0% | 34.0% | +48.0 | < 0.001 | | Multi-Session Continuity | 84.0% | 33.0% | +51.0 | < 0.001 |

CortexDB outperforms across all scenarios. The largest delta is in Incident Response (+58.0) and Deployment Tracking (+56.0), which involve the most temporal reasoning. The smallest delta is in People & Relationships (+48.0), where both systems benefit from entity-name keyword matching.

3.3 Accuracy by Query Category

| Category | CortexDB | Baseline | Delta | |---|---|---|---| | Factual | 89.3% | 38.0% | +51.3 | | Temporal | 85.0% | 18.0% | +67.0 | | Multi-hop | 82.4% | 32.8% | +49.6 | | Synthesis | 78.4% | 33.6% | +44.8 |

The largest gap is in temporal queries (+67.0), where the baseline system's rewriting approach systematically destroys temporal information. The baseline scores only 18% on temporal queries -- the rewriting process merges events from different time periods, making time-specific retrieval nearly impossible.

3.4 Retrieval Strategy Contribution

CortexDB's hybrid retrieval was ablated to measure each strategy's contribution:

| Configuration | Accuracy | Delta vs. Full | |---|---|---| | Full hybrid retrieval | 84.2% | -- | | Without connected context | 66.4% | -17.8 | | Without keyword matching | 71.2% | -13.0 | | Without semantic matching | 68.8% | -15.4 | | Any single strategy alone | 38.6% - 44.2% | -40.0 to -45.6 |

No single retrieval strategy achieves competitive accuracy alone. The combination of all strategies provides a +17.8 to +45.6 point improvement over any individual approach. Connected context (via the knowledge graph) contributes the most unique value, consistent with its ability to find related information through entity connections.

3.5 Ingest Performance

| Metric | CortexDB | Baseline | |---|---|---| | Sustained throughput | 850 eps/s | 340 eps/s | | p50 write latency | 4ms | 680ms | | p99 write latency | 12ms | 2,400ms | | Write-path error rate | 0.0% | 4.2% | | Write-path LLM calls | 0 | 10,000 |

CortexDB's write path is a disk append with no external service dependencies. The baseline system's write path requires an LLM call for every episode to merge it with existing memories, introducing two orders of magnitude more latency, a 4.2% error rate from LLM failures and timeouts, and a linear scaling of LLM costs with write volume.

3.6 Cost Analysis

At a sustained ingestion rate of 100 episodes per minute (6,000 episodes per hour, representative of a 50-person engineering team with Slack, GitHub, Jira, and PagerDuty connectors):

| Component | CortexDB | Baseline | |---|---|---| | LLM calls (write path) | $0.00 | $9,600.00 | | Embedding generation | $18.00 | $18.00 | | Compute (r6g.2xlarge) | $8.20 | $8.20 | | Storage (gp3 SSD) | $2.40 | $1.80 | | Total daily cost | $28.60 | $9,628.00 |

The cost ratio is 337:1 in favor of CortexDB. The dominant cost for the baseline system is LLM inference on the write path, which scales linearly with ingest volume. CortexDB's costs are dominated by embedding generation (read path) and compute, both of which scale sub-linearly.

Note

Baseline write-path LLM cost assumes GPT-4o at $5/1M input tokens and $15/1M output tokens, with approximately 1,000 input tokens and 500 output tokens per memory operation. Actual costs depend on episode length and LLM pricing.

4. Analysis

4.1 Why Event Sourcing Outperforms

The retrieval accuracy gap is a direct consequence of information preservation vs. information destruction on the write path.

LLM rewriting is a lossy compression applied to every write. When the baseline system merges a new episode with existing memories, the LLM decides what to keep, what to summarize, and what to discard. This decision is irreversible. Information that the LLM deemed unimportant at write time may be exactly what a future query needs.

Qualitative analysis of the baseline system's failure cases reveals three dominant failure modes:

  1. Temporal collapse (41% of failures). Multiple events from different time periods are merged into a single memory that loses temporal specificity. A query asking "What happened on March 3rd?" cannot be answered because March 3rd's events have been merged with events from other dates.

  2. Detail elision (35% of failures). Specific details -- exact numbers, configuration values, error messages, people's names -- are dropped during summarization. A query asking "What was the error rate during the outage?" fails because the LLM summarized "error rate exceeded 5.2% for 47 minutes" as "there was a significant error rate increase."

  3. Causal chain breaking (24% of failures). The causal relationship between events is lost during merging. Event A caused Event B, but after merging, the connection is gone. Multi-hop queries that follow causal chains fail.

Event sourcing preserves all information on the write path. CortexDB appends the raw episode to an immutable log. No LLM is involved. The original content is available for any future query, regardless of what the system considered important at write time.

Information filtering in CortexDB happens at query time (retrieval, ranking, deduplication), not at write time. This filtering is:

  • Query-specific -- different queries produce different filtered views
  • Reversible -- a different query can retrieve the information that was filtered out
  • Improvable -- as retrieval algorithms improve, past data benefits retroactively

4.2 Architectural Advantages

Beyond retrieval accuracy, event sourcing provides several structural advantages for AI memory:

Temporal queries. Because events are stored with their original timestamps and never merged, CortexDB supports point-in-time queries, change detection, and temporal ordering natively. These query types are impossible in a rewriting architecture.

Rebuild capability. All derived views can be rebuilt from the source data. A corrupted index is an operational annoyance, not a data loss event. In a rewriting architecture, the stored memories are the only copy of the data; corruption is catastrophic.

Schema evolution. New retrieval strategies or extraction pipelines can be added and backfilled from historical data. If CortexDB adds a new capability in the future, it can be applied to all existing data without migration.

Crash durability. CortexDB guarantees that once a client receives an acknowledgment, the data is durable on disk. Processing happens asynchronously; a crash during processing does not lose data. In a rewriting architecture, a crash during the LLM rewrite may leave the memory store in an inconsistent state.

4.3 Trade-Offs

Event sourcing is not without trade-offs:

Storage cost. CortexDB stores more data than a rewriting system because it preserves all episodes rather than compressing them through summarization. In our benchmark, CortexDB used approximately 1.3x more storage than the baseline system for the same dataset.

Read-path complexity. Because the "current state" is not pre-computed, CortexDB must run a more complex retrieval pipeline at query time. This adds ~10ms of read-path latency compared to a simple vector search.

View consistency. Derived views are eventually consistent with the primary data. A recall immediately after remember will find the raw content but may not yet reflect all derived information. Rewriting architectures maintain a consistent merged view.

5. Threats to Validity

External validity. Our scenarios are weighted toward engineering and DevOps use cases. Results may differ for other domains (e.g., healthcare, legal, personal assistant). We encourage the community to extend the benchmark with additional scenarios.

Baseline configuration. The baseline system was tested with default settings via its cloud API. Custom configurations or self-hosted deployments with tuned settings may produce different results.

LLM sensitivity. Both systems depend on LLMs (the baseline system for write-path rewriting, CortexDB for read-path entity extraction). Results may vary with different LLMs or model versions.

Scale limitations. The benchmark uses 10,000 episodes. Production systems may operate at orders of magnitude larger scale, where the relative performance characteristics could shift.

Evaluation methodology. Binary correct/incorrect scoring does not capture partial credit for nearly-correct answers. A more nuanced scoring method might narrow (or widen) the gap.

6. Related Work

LOCOMO (Maharana et al., 2024). A benchmark for long-context memory in conversational AI. Our methodology extends LOCOMO to production-scale data volumes and operational metrics.

MemGPT (Packer et al., 2023). An operating-system-inspired approach to LLM memory management using hierarchical memory tiers. MemGPT focuses on extending effective context within a single session rather than cross-session persistence.

RAG (Lewis et al., 2020). Retrieval-Augmented Generation established the pattern of augmenting LLM generation with retrieved documents. CortexDB's hybrid retrieval builds on RAG's foundation with multi-strategy fusion and knowledge graph integration.

Event Sourcing (Fowler, 2005; Betts et al., 2013). The event sourcing pattern originates from domain-driven design and has been applied extensively in financial systems and microservice architectures. CortexDB adapts event sourcing specifically for AI memory workloads.

7. Conclusion

The memory architecture is the dominant factor in AI memory system performance. Using identical language models and embedding models, event-sourced memory (CortexDB) achieves 84.2% retrieval accuracy compared to 31.8% for the LLM-rewriting system, a 52.4 percentage point difference.

This gap is not an implementation artifact. It is a structural consequence of the information-preserving vs. information-destroying properties of the two architectures. Event sourcing guarantees that no information is lost on the write path. LLM rewriting guarantees that information is lost on every write.

For AI systems that must operate across sessions, accumulate knowledge over time, and answer queries that require temporal reasoning, causal chain following, or multi-hop retrieval, event-sourced memory is the superior architecture.

The benchmark suite, test data, and evaluation methodology are available at github.com/pmalik/cortex-benchmark.

References

  1. Maharana, A., Lee, D., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). LOCOMO: Long-Context Memory Benchmark for Evaluating Conversational AI Systems.

  2. Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems.

  3. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

  4. Fowler, M. (2005). Event Sourcing. martinfowler.com.

  5. Betts, D., Dominguez, J., Melnik, G., Simonazzi, F., & Subramanian, M. (2013). Exploring CQRS and Event Sourcing. Microsoft patterns & practices.