Every recall-strategy knob — graph retrieval, HyDE, multihop, salience, reranker, and the constants that aren't (yet) configurable.
Recall Tuning
CortexDB's recall pipeline runs six retrieval channels in parallel and fuses them with reciprocal rank fusion. About a dozen env vars expose knobs into that pipeline; another two dozen are compiled constants that we tuned against LongMemEval-S and LoCoMo and didn't expose.
This page covers what's tunable and when to touch it. Most deployments shouldn't tune any of these — the defaults match the published 93.8% number.
The pipeline at a glance
Query
│
├─► Query routing ─────────────────────────► question_type
│ ('single-session-user', 'multi-session', ...)
│
├─► (optional) HyDE multiquery expansion ──► N hypothesized passages
│ embedded → query vectors
│
├─► (optional) Multihop query planner ─────► M follow-up queries
│ generated by LLM
│
├──► Run K parallel retrieval channels:
│ • Vector (HNSW)
│ • Fulltext (BM25 + WordNet)
│ • Entity-name (exact / fuzzy)
│ • Synonym
│ • Graph BFS (KG edges around seed entities)
│ • Temporal (recency window + decay)
│
├──► Reciprocal rank fusion (RRF, k=60) ──► fused candidate list
│
├──► Cross-encoder rerank (optional, ~25-40 candidates) ──► reranked top-K
│
└──► Build response pack (citations, beliefs, episodes)
Each stage has knobs. Skipping a stage saves latency at the cost of recall accuracy. The defaults keep every stage on — the published benchmark numbers depend on the full pipeline.
Graph retrieval
The KG channel walks edges around entities mentioned in the query.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_GRAPH_RETRIEVAL_DISABLE | (unset) | Set =1 to skip the graph channel entirely. ~-3-5 pp on multi-session in our A/B. |
CORTEX_GRAPH_RETRIEVAL_TOP_K | 40 (single-session) / 120 (multi-session) | Number of graph-derived candidates passed to fusion. |
When to change:
- Disable graph retrieval (
CORTEX_GRAPH_RETRIEVAL_DISABLE=1) if you're on the voice/realtime hot path and willing to trade -3 pp for ~150 ms. - Bump
TOP_Kto 80 (single) / 240 (multi) for query types where you expect entity-rich answers (e.g. "what did X say about Y at Z time"). Diminishing returns past these values.
Constants you cannot currently override (in cortex-coordinator/src/recall.rs):
GRAPH_RETRIEVAL_MAX_ENTITIES = 48(cap on KG entities to walk from)GRAPH_RETRIEVAL_MAX_EDGES = 512(cap on edges to traverse per walk)GRAPH_RETRIEVAL_MAX_EPISODES = 256(cap on episodes pulled by walk)GRAPH_WEIGHT = 0.20(graph channel's contribution to fusion)
HyDE multiquery expansion
HyDE (Hypothetical Document Embeddings) asks an LLM to write a hypothetical passage that would answer the query, then embeds that passage instead of (or in addition to) the literal query. Captures meaning even when the query and the stored memory use very different words.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_HYDE_PASSAGES_MS | 1 | Number of hypothetical passages to generate for multi-session queries. Set 0 to disable HyDE for multi-session. |
CORTEX_HYDE_MULTIQUERY_DISABLED_TYPES | multi-session,open-domain | Comma-separated question types where HyDE is off. |
When to change:
- Set
CORTEX_HYDE_MULTIQUERY_DISABLED_TYPES=single-session-user,single-session-assistant,multi-session,open-domainto disable HyDE entirely. Saves one LLM round-trip (~150-400 ms). Lose ~1-2 pp on the queries where stored phrasing doesn't match query phrasing. - Bump
CORTEX_HYDE_PASSAGES_MS=3to generate three hypothetical passages with temperatures[0.3, 0.6, 0.9], giving wider semantic coverage. Triples the HyDE LLM cost.
Compiled constants:
HYDE_MS_TEMPERATURES = [0.3, 0.6, 0.9]— the temperature schedule for multi-passage HyDEDEFAULT_HYDE_MULTIQUERY_DISABLED_TYPES = ["multi-session", "open-domain"]— overridable via the env var above
Multihop query planner
For complex queries, an LLM plans 2-N follow-up queries that explore related angles, then runs all of them through retrieval.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_MULTIHOP_QUERY_PLANNER_DISABLE | (unset) | Set =1 to disable multihop entirely. |
CORTEX_MULTIHOP_QUERY_PLANNER_TYPES | multi-session,open-domain | Comma-separated types where multihop runs. |
CORTEX_MULTIHOP_QUERY_COUNT | 4 | Number of follow-up queries the planner generates. |
CORTEX_MULTIHOP_MAX_QUERY_FANOUT | 5 | Cap on simultaneously executing planned queries. |
CORTEX_MULTIHOP_COVERAGE_ORDER_DISABLE | (unset) | Set =1 to use original LLM-emitted order instead of coverage-optimal reordering. |
When to change:
- Set
CORTEX_MULTIHOP_QUERY_PLANNER_TYPES=(empty) for latency-sensitive deployments. Saves 1-3 LLM round-trips per recall. Loses ~1-3 pp on complex multi-session queries. - Lower
CORTEX_MULTIHOP_QUERY_COUNT=2for a middle ground: keeps the planner but reduces its fanout. - Higher
CORTEX_MULTIHOP_QUERY_COUNT=6andCORTEX_MULTIHOP_MAX_QUERY_FANOUT=8for offline question-answering where wall-clock doesn't matter.
Salience prior
Salience is a per-memory importance score, updated by access patterns over time. The recall ranker can incorporate it as a prior — recently-accessed memories get a small boost.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_SALIENCE_WEIGHT | 0.10 | Weight of salience score in final ranking. Range [0.0, 1.0]. |
CORTEX_AUTO_ROUTE | (unset) | Set =1 to let the router pick per-query-type salience weights automatically. |
When to change:
- For "what's recently relevant" agents (companion bots, daily assistant), bump
CORTEX_SALIENCE_WEIGHT=0.20to lean harder on recency. - For historical-archive workloads (legal discovery, CRM search), set
CORTEX_SALIENCE_WEIGHT=0.0— the right answer might be from years ago, not last week.
Entity-vector seeding
Hybrid signal: take the query's entity mentions, look up their canonical vectors, use those as additional query vectors.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_ENTITY_VECTOR_SEED_ENABLE | (unset) | Set =1 to enable. Off by default. |
CORTEX_VECTOR_TENANT_OVERFETCH | 1.25 | Tenant-aware overfetch multiplier for the vector channel — fetch 25% more candidates so post-filter on scope still hits the target K. |
Compiled constants for entity-vector seeding:
ENTITY_VECTOR_SEED_TOP_K = 10(vectors per entity)ENTITY_VECTOR_SEED_MIN_SIMILARITY = 0.40(cosine floor)ENTITY_VECTOR_SPAN_LIMIT = 5(entities per query)ENTITY_VECTOR_PER_SPAN_TOP_K = 5(vectors per entity span)
When to enable: queries with named entities that don't match stored phrasing exactly ("the customer in Boston" vs stored "Acme HQ in MA"). Adds one HNSW lookup per entity span.
Reranker
A cross-encoder model that takes the top ~25-40 fused candidates and re-scores each (query, candidate) pair as a unit, producing a final ranking.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_RERANKER_PROVIDER | (empty = disabled) | cohere or local. Empty disables the reranker. |
CORTEX_RERANKER_MODEL | rerank-v3.5 (Cohere) | Model name. |
CORTEX_RERANKER_MODEL_PATH | (none) | Path to a local ONNX model when PROVIDER=local. |
When to change:
- Enable
CORTEX_RERANKER_PROVIDER=cohere+COHERE_API_KEY=...for ~+2 pp on noisy recall sets (mixed-corpus, conversational queries). Costs ~$0.001 per recall. Adds ~80-200 ms. - Use
CORTEX_RERANKER_PROVIDER=local+CORTEX_RERANKER_MODEL_PATH=/models/bge-reranker.onnxfor self-contained deployments. Slower (~200-500 ms on CPU) but free. - Leave disabled for voice / sub-100ms paths.
Question-type executor switches
The /v1/answer endpoint routes queries through type-specific executors. These flags disable specific paths for A/B testing or to work around bugs.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_MS_EXECUTOR_DISABLE | (unset) | Disable the multi-session-optimized executor. Falls back to single-session path. |
CORTEX_MS_EVIDENCE_PACK_DISABLE | (unset) | Skip evidence-pack assembly for multi-session. Smaller answers, less context. |
CORTEX_MS_COUNT_RELEVANCE_ENABLE | (unset) | Enable count-based relevance scoring for multi-session (experimental). |
CORTEX_MS_STAGE_C_USE_VERIFIER | (unset) | Use the verifier model in stage-C answer formatting for multi-session. |
CORTEX_MS_RETRY_ON_ABSTAIN | (unset) | Retry recall when multi-session executor abstains. |
CORTEX_DIRECT_LOOKUP_RETRY_ON_ABSTAIN | (unset) | Retry direct-lookup queries on abstain. |
CORTEX_TEMPORAL_EXTRACT_DISABLE | (unset) | Disable temporal-phrase extraction (relative dates, durations). |
CORTEX_COMPOSITIONAL_ENABLE | (unset) | Enable typed-arithmetic compositional answers (experimental). |
CORTEX_ENUMERATE_COUNT_ENABLE | (unset) | Enable count-enumeration answers ("how many X are there"). |
CORTEX_SESSION_LEVEL_EXPANSION_ENABLE | (unset) | Expand recall context at session level rather than per-message. |
CORTEX_ANSWER_SHAPE_EXECUTOR_USE_VERIFIER | (unset) | Use verifier in shape-aware executor. |
CORTEX_FACT_EVENT_PROMOTION_ENABLE | (unset) | Promote facts to event-level relevance. |
CORTEX_FACT_VALIDITY_FILTER | (unset) | Filter recalled facts by validity windows (bi-temporal). |
Default operator stance: don't touch any of these. They exist for our benchmark tooling to validate routing decisions. The compiled defaults are the configuration that produced 93.8% on LongMemEval-S.
If you're debugging a specific recall failure ("multi-session is hallucinating") and have an A/B harness, flipping CORTEX_MS_EVIDENCE_PACK_DISABLE=1 is a reasonable diagnostic to confirm the evidence pack is the culprit.
Synchronous-write kill switches
These don't affect recall directly — they affect the write path, which in turn affects when recall has the data available.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_SYNC_FACT_EXTRACT_DISABLE | (unset) | Skip synchronous fact extraction on write. All extraction goes async. Faster writes; recall lags. |
CORTEX_SYNC_FACT_EXTRACT_MAX_SESSIONS | (no cap) | Cap on sessions processed sync per batch. |
CORTEX_SYNC_GRAPH_SEED_DISABLE | (unset) | Skip synchronous graph seeding. Same tradeoff. |
CORTEX_SYNC_GRAPH_SEED_MAX_BULK_MEMORIES | (no cap) | Cap on memories processed in sync graph seeding. |
CORTEX_SYNC_GRAPH_SEED_MAX_ENTITIES | (no cap) | Cap on entities processed in sync graph seeding. |
When to disable: batch ingest (see the Batch profile) or any workload where writes outnumber reads and the recall layer can lag by 30-60 s without breaking the user experience.
Memory evolution (methylation + consolidation)
Background-scheduled jobs that prune low-utility memories and consolidate related ones.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_METHYLATION_INACTIVITY_HOURS | 168 (7 days) | Memories unaccessed for this long are eligible for pruning. |
CORTEX_METHYLATION_MIN_ACCESS | 10 | Min access count before a memory becomes pruning-eligible. |
CORTEX_METHYLATION_MIN_UTIL_RATIO | 0.30 | Min utility-to-access ratio. Below this = pruning candidate. |
CORTEX_METHYLATION_MIN_ACCESSES_FOR_RATIO | 5 | Min accesses required before the ratio is even evaluated. |
CORTEX_CONSOLIDATION_MIN_MEMORIES | 2 | Min memories about the same entity to trigger consolidation. |
CORTEX_CONSOLIDATION_MAX_BATCH | 10 | Max consolidations per scheduler tick. |
CORTEX_CONSOLIDATION_MIN_AGE_HOURS | 24 | Memories younger than this aren't consolidated (let them stabilize first). |
CORTEX_CONSOLIDATION_MAX_SURPRISE | 0.5 | Don't consolidate memories above this surprise score (they're outliers worth keeping atomic). |
When to change:
- For dense / chatty agents (lots of low-value chitchat events), tighten methylation:
CORTEX_METHYLATION_INACTIVITY_HOURS=72(3 days). Prunes more aggressively. - For archival workloads, loosen:
CORTEX_METHYLATION_INACTIVITY_HOURS=720(30 days) so quarterly-relevant memories don't get pruned during off-quarters. - For benchmark runs, set
CORTEX_SCHEDULER_DISABLE=1to disable the whole scheduler — methylation and consolidation will not run, ensuring stable recall over long evals.
Compiled constants you can't (yet) override
These live in cortex-coordinator/src/recall.rs as const. They were tuned against LongMemEval-S + LoCoMo. If you need to override one for an unusual workload, file an issue — we may promote it to an env var.
| Constant | Value | What it controls |
|---|---|---|
RETRIEVAL_TOP_K | 40 | Candidates from each retrieval channel for single-session queries. |
RETRIEVAL_TOP_K_MS | 160 | Same, for multi-session queries (wider pool). |
RERANK_POOL | 25 | Top-N passed to the reranker (single-session). |
RERANK_POOL_MS | 40 | Same, multi-session. |
RRF_K | 60.0 | Smoothing constant in the reciprocal-rank-fusion formula. |
GRAPH_WEIGHT | 0.20 | Graph channel's weight in the fused score. |
GRAPH_RETRIEVAL_MAX_ENTITIES | 48 | Cap on KG entities to walk from per query. |
GRAPH_RETRIEVAL_MAX_EDGES | 512 | Cap on edges per BFS walk. |
GRAPH_RETRIEVAL_MAX_EPISODES | 256 | Cap on episodes pulled. |
ENTITY_VECTOR_SEED_TOP_K | 10 | Vectors per entity for entity-vector seeding. |
ENTITY_VECTOR_SEED_MIN_SIMILARITY | 0.40 | Cosine floor for entity-vector seed candidates. |
ENTITY_VECTOR_SPAN_LIMIT | 5 | Max entities considered per query. |
ENTITY_VECTOR_PER_SPAN_TOP_K | 5 | Vectors per entity span. |
HYDE_PASSAGES_MS_DEFAULT | 1 | Default HyDE passages for multi-session (matches env default). |
SESSION_BALANCE_ENABLED_TYPES | ["multi-session"] | Question types where per-session candidate balancing is on. |
Latency budget breakdown (default config)
Rough p50 budget for a single /v1/recall against a ~100K event scope:
| Stage | p50 latency | Optional? |
|---|---|---|
| Query embedding | ~50 ms | No |
| HyDE multiquery (if enabled) | ~250 ms | Yes — disable per-type |
| Multihop planner (if enabled) | ~400 ms | Yes — disable per-type |
| Vector + fulltext + KG retrieval (parallel) | ~80 ms | No |
| RRF fusion | ~2 ms | No |
| Cross-encoder rerank (if enabled) | ~150 ms | Yes |
| Response pack assembly | ~30 ms | No |
| Total (default config, multi-session query) | ~900 ms | |
| Total (voice profile, single-session query) | ~180 ms |
The HyDE and multihop costs dominate for any query type where they run. Disabling them is the single highest-leverage latency win for voice/realtime.
Next steps
- Profiles & Presets — see the Voice profile for a complete sub-100ms config
- Embeddings — vector dim and model choice that feeds this pipeline
- Benchmarking — how the LongMemEval-S number was produced with all of these defaults