Every recall-strategy knob — graph retrieval, HyDE, multihop, salience, reranker, and the constants that aren't (yet) configurable.

Recall Tuning

CortexDB's recall pipeline runs six retrieval channels in parallel and fuses them with reciprocal rank fusion. About a dozen env vars expose knobs into that pipeline; another two dozen are compiled constants that we tuned against LongMemEval-S and LoCoMo and didn't expose.

This page covers what's tunable and when to touch it. Most deployments shouldn't tune any of these — the defaults match the published 93.8% number.

The pipeline at a glance

Query
  │
  ├─► Query routing  ─────────────────────────► question_type
  │     ('single-session-user', 'multi-session', ...)
  │
  ├─► (optional) HyDE multiquery expansion ──► N hypothesized passages
  │                                              embedded → query vectors
  │
  ├─► (optional) Multihop query planner ─────► M follow-up queries
  │                                              generated by LLM
  │
  ├──► Run K parallel retrieval channels:
  │      • Vector (HNSW)
  │      • Fulltext (BM25 + WordNet)
  │      • Entity-name (exact / fuzzy)
  │      • Synonym
  │      • Graph BFS (KG edges around seed entities)
  │      • Temporal (recency window + decay)
  │
  ├──► Reciprocal rank fusion (RRF, k=60)  ──► fused candidate list
  │
  ├──► Cross-encoder rerank (optional, ~25-40 candidates) ──► reranked top-K
  │
  └──► Build response pack (citations, beliefs, episodes)

Each stage has knobs. Skipping a stage saves latency at the cost of recall accuracy. The defaults keep every stage on — the published benchmark numbers depend on the full pipeline.

Graph retrieval

The KG channel walks edges around entities mentioned in the query.

Env varDefaultWhat it controls
CORTEX_GRAPH_RETRIEVAL_DISABLE(unset)Set =1 to skip the graph channel entirely. ~-3-5 pp on multi-session in our A/B.
CORTEX_GRAPH_RETRIEVAL_TOP_K40 (single-session) / 120 (multi-session)Number of graph-derived candidates passed to fusion.

When to change:

  • Disable graph retrieval (CORTEX_GRAPH_RETRIEVAL_DISABLE=1) if you're on the voice/realtime hot path and willing to trade -3 pp for ~150 ms.
  • Bump TOP_K to 80 (single) / 240 (multi) for query types where you expect entity-rich answers (e.g. "what did X say about Y at Z time"). Diminishing returns past these values.

Constants you cannot currently override (in cortex-coordinator/src/recall.rs):

  • GRAPH_RETRIEVAL_MAX_ENTITIES = 48 (cap on KG entities to walk from)
  • GRAPH_RETRIEVAL_MAX_EDGES = 512 (cap on edges to traverse per walk)
  • GRAPH_RETRIEVAL_MAX_EPISODES = 256 (cap on episodes pulled by walk)
  • GRAPH_WEIGHT = 0.20 (graph channel's contribution to fusion)

HyDE multiquery expansion

HyDE (Hypothetical Document Embeddings) asks an LLM to write a hypothetical passage that would answer the query, then embeds that passage instead of (or in addition to) the literal query. Captures meaning even when the query and the stored memory use very different words.

Env varDefaultWhat it controls
CORTEX_HYDE_PASSAGES_MS1Number of hypothetical passages to generate for multi-session queries. Set 0 to disable HyDE for multi-session.
CORTEX_HYDE_MULTIQUERY_DISABLED_TYPESmulti-session,open-domainComma-separated question types where HyDE is off.

When to change:

  • Set CORTEX_HYDE_MULTIQUERY_DISABLED_TYPES=single-session-user,single-session-assistant,multi-session,open-domain to disable HyDE entirely. Saves one LLM round-trip (~150-400 ms). Lose ~1-2 pp on the queries where stored phrasing doesn't match query phrasing.
  • Bump CORTEX_HYDE_PASSAGES_MS=3 to generate three hypothetical passages with temperatures [0.3, 0.6, 0.9], giving wider semantic coverage. Triples the HyDE LLM cost.

Compiled constants:

  • HYDE_MS_TEMPERATURES = [0.3, 0.6, 0.9] — the temperature schedule for multi-passage HyDE
  • DEFAULT_HYDE_MULTIQUERY_DISABLED_TYPES = ["multi-session", "open-domain"] — overridable via the env var above

Multihop query planner

For complex queries, an LLM plans 2-N follow-up queries that explore related angles, then runs all of them through retrieval.

Env varDefaultWhat it controls
CORTEX_MULTIHOP_QUERY_PLANNER_DISABLE(unset)Set =1 to disable multihop entirely.
CORTEX_MULTIHOP_QUERY_PLANNER_TYPESmulti-session,open-domainComma-separated types where multihop runs.
CORTEX_MULTIHOP_QUERY_COUNT4Number of follow-up queries the planner generates.
CORTEX_MULTIHOP_MAX_QUERY_FANOUT5Cap on simultaneously executing planned queries.
CORTEX_MULTIHOP_COVERAGE_ORDER_DISABLE(unset)Set =1 to use original LLM-emitted order instead of coverage-optimal reordering.

When to change:

  • Set CORTEX_MULTIHOP_QUERY_PLANNER_TYPES= (empty) for latency-sensitive deployments. Saves 1-3 LLM round-trips per recall. Loses ~1-3 pp on complex multi-session queries.
  • Lower CORTEX_MULTIHOP_QUERY_COUNT=2 for a middle ground: keeps the planner but reduces its fanout.
  • Higher CORTEX_MULTIHOP_QUERY_COUNT=6 and CORTEX_MULTIHOP_MAX_QUERY_FANOUT=8 for offline question-answering where wall-clock doesn't matter.

Salience prior

Salience is a per-memory importance score, updated by access patterns over time. The recall ranker can incorporate it as a prior — recently-accessed memories get a small boost.

Env varDefaultWhat it controls
CORTEX_SALIENCE_WEIGHT0.10Weight of salience score in final ranking. Range [0.0, 1.0].
CORTEX_AUTO_ROUTE(unset)Set =1 to let the router pick per-query-type salience weights automatically.

When to change:

  • For "what's recently relevant" agents (companion bots, daily assistant), bump CORTEX_SALIENCE_WEIGHT=0.20 to lean harder on recency.
  • For historical-archive workloads (legal discovery, CRM search), set CORTEX_SALIENCE_WEIGHT=0.0 — the right answer might be from years ago, not last week.

Entity-vector seeding

Hybrid signal: take the query's entity mentions, look up their canonical vectors, use those as additional query vectors.

Env varDefaultWhat it controls
CORTEX_ENTITY_VECTOR_SEED_ENABLE(unset)Set =1 to enable. Off by default.
CORTEX_VECTOR_TENANT_OVERFETCH1.25Tenant-aware overfetch multiplier for the vector channel — fetch 25% more candidates so post-filter on scope still hits the target K.

Compiled constants for entity-vector seeding:

  • ENTITY_VECTOR_SEED_TOP_K = 10 (vectors per entity)
  • ENTITY_VECTOR_SEED_MIN_SIMILARITY = 0.40 (cosine floor)
  • ENTITY_VECTOR_SPAN_LIMIT = 5 (entities per query)
  • ENTITY_VECTOR_PER_SPAN_TOP_K = 5 (vectors per entity span)

When to enable: queries with named entities that don't match stored phrasing exactly ("the customer in Boston" vs stored "Acme HQ in MA"). Adds one HNSW lookup per entity span.

Reranker

A cross-encoder model that takes the top ~25-40 fused candidates and re-scores each (query, candidate) pair as a unit, producing a final ranking.

Env varDefaultWhat it controls
CORTEX_RERANKER_PROVIDER(empty = disabled)cohere or local. Empty disables the reranker.
CORTEX_RERANKER_MODELrerank-v3.5 (Cohere)Model name.
CORTEX_RERANKER_MODEL_PATH(none)Path to a local ONNX model when PROVIDER=local.

When to change:

  • Enable CORTEX_RERANKER_PROVIDER=cohere + COHERE_API_KEY=... for ~+2 pp on noisy recall sets (mixed-corpus, conversational queries). Costs ~$0.001 per recall. Adds ~80-200 ms.
  • Use CORTEX_RERANKER_PROVIDER=local + CORTEX_RERANKER_MODEL_PATH=/models/bge-reranker.onnx for self-contained deployments. Slower (~200-500 ms on CPU) but free.
  • Leave disabled for voice / sub-100ms paths.

Question-type executor switches

The /v1/answer endpoint routes queries through type-specific executors. These flags disable specific paths for A/B testing or to work around bugs.

Env varDefaultWhat it controls
CORTEX_MS_EXECUTOR_DISABLE(unset)Disable the multi-session-optimized executor. Falls back to single-session path.
CORTEX_MS_EVIDENCE_PACK_DISABLE(unset)Skip evidence-pack assembly for multi-session. Smaller answers, less context.
CORTEX_MS_COUNT_RELEVANCE_ENABLE(unset)Enable count-based relevance scoring for multi-session (experimental).
CORTEX_MS_STAGE_C_USE_VERIFIER(unset)Use the verifier model in stage-C answer formatting for multi-session.
CORTEX_MS_RETRY_ON_ABSTAIN(unset)Retry recall when multi-session executor abstains.
CORTEX_DIRECT_LOOKUP_RETRY_ON_ABSTAIN(unset)Retry direct-lookup queries on abstain.
CORTEX_TEMPORAL_EXTRACT_DISABLE(unset)Disable temporal-phrase extraction (relative dates, durations).
CORTEX_COMPOSITIONAL_ENABLE(unset)Enable typed-arithmetic compositional answers (experimental).
CORTEX_ENUMERATE_COUNT_ENABLE(unset)Enable count-enumeration answers ("how many X are there").
CORTEX_SESSION_LEVEL_EXPANSION_ENABLE(unset)Expand recall context at session level rather than per-message.
CORTEX_ANSWER_SHAPE_EXECUTOR_USE_VERIFIER(unset)Use verifier in shape-aware executor.
CORTEX_FACT_EVENT_PROMOTION_ENABLE(unset)Promote facts to event-level relevance.
CORTEX_FACT_VALIDITY_FILTER(unset)Filter recalled facts by validity windows (bi-temporal).

Default operator stance: don't touch any of these. They exist for our benchmark tooling to validate routing decisions. The compiled defaults are the configuration that produced 93.8% on LongMemEval-S.

If you're debugging a specific recall failure ("multi-session is hallucinating") and have an A/B harness, flipping CORTEX_MS_EVIDENCE_PACK_DISABLE=1 is a reasonable diagnostic to confirm the evidence pack is the culprit.

Synchronous-write kill switches

These don't affect recall directly — they affect the write path, which in turn affects when recall has the data available.

Env varDefaultWhat it controls
CORTEX_SYNC_FACT_EXTRACT_DISABLE(unset)Skip synchronous fact extraction on write. All extraction goes async. Faster writes; recall lags.
CORTEX_SYNC_FACT_EXTRACT_MAX_SESSIONS(no cap)Cap on sessions processed sync per batch.
CORTEX_SYNC_GRAPH_SEED_DISABLE(unset)Skip synchronous graph seeding. Same tradeoff.
CORTEX_SYNC_GRAPH_SEED_MAX_BULK_MEMORIES(no cap)Cap on memories processed in sync graph seeding.
CORTEX_SYNC_GRAPH_SEED_MAX_ENTITIES(no cap)Cap on entities processed in sync graph seeding.

When to disable: batch ingest (see the Batch profile) or any workload where writes outnumber reads and the recall layer can lag by 30-60 s without breaking the user experience.

Memory evolution (methylation + consolidation)

Background-scheduled jobs that prune low-utility memories and consolidate related ones.

Env varDefaultWhat it controls
CORTEX_METHYLATION_INACTIVITY_HOURS168 (7 days)Memories unaccessed for this long are eligible for pruning.
CORTEX_METHYLATION_MIN_ACCESS10Min access count before a memory becomes pruning-eligible.
CORTEX_METHYLATION_MIN_UTIL_RATIO0.30Min utility-to-access ratio. Below this = pruning candidate.
CORTEX_METHYLATION_MIN_ACCESSES_FOR_RATIO5Min accesses required before the ratio is even evaluated.
CORTEX_CONSOLIDATION_MIN_MEMORIES2Min memories about the same entity to trigger consolidation.
CORTEX_CONSOLIDATION_MAX_BATCH10Max consolidations per scheduler tick.
CORTEX_CONSOLIDATION_MIN_AGE_HOURS24Memories younger than this aren't consolidated (let them stabilize first).
CORTEX_CONSOLIDATION_MAX_SURPRISE0.5Don't consolidate memories above this surprise score (they're outliers worth keeping atomic).

When to change:

  • For dense / chatty agents (lots of low-value chitchat events), tighten methylation: CORTEX_METHYLATION_INACTIVITY_HOURS=72 (3 days). Prunes more aggressively.
  • For archival workloads, loosen: CORTEX_METHYLATION_INACTIVITY_HOURS=720 (30 days) so quarterly-relevant memories don't get pruned during off-quarters.
  • For benchmark runs, set CORTEX_SCHEDULER_DISABLE=1 to disable the whole scheduler — methylation and consolidation will not run, ensuring stable recall over long evals.

Compiled constants you can't (yet) override

These live in cortex-coordinator/src/recall.rs as const. They were tuned against LongMemEval-S + LoCoMo. If you need to override one for an unusual workload, file an issue — we may promote it to an env var.

ConstantValueWhat it controls
RETRIEVAL_TOP_K40Candidates from each retrieval channel for single-session queries.
RETRIEVAL_TOP_K_MS160Same, for multi-session queries (wider pool).
RERANK_POOL25Top-N passed to the reranker (single-session).
RERANK_POOL_MS40Same, multi-session.
RRF_K60.0Smoothing constant in the reciprocal-rank-fusion formula.
GRAPH_WEIGHT0.20Graph channel's weight in the fused score.
GRAPH_RETRIEVAL_MAX_ENTITIES48Cap on KG entities to walk from per query.
GRAPH_RETRIEVAL_MAX_EDGES512Cap on edges per BFS walk.
GRAPH_RETRIEVAL_MAX_EPISODES256Cap on episodes pulled.
ENTITY_VECTOR_SEED_TOP_K10Vectors per entity for entity-vector seeding.
ENTITY_VECTOR_SEED_MIN_SIMILARITY0.40Cosine floor for entity-vector seed candidates.
ENTITY_VECTOR_SPAN_LIMIT5Max entities considered per query.
ENTITY_VECTOR_PER_SPAN_TOP_K5Vectors per entity span.
HYDE_PASSAGES_MS_DEFAULT1Default HyDE passages for multi-session (matches env default).
SESSION_BALANCE_ENABLED_TYPES["multi-session"]Question types where per-session candidate balancing is on.

Latency budget breakdown (default config)

Rough p50 budget for a single /v1/recall against a ~100K event scope:

Stagep50 latencyOptional?
Query embedding~50 msNo
HyDE multiquery (if enabled)~250 msYes — disable per-type
Multihop planner (if enabled)~400 msYes — disable per-type
Vector + fulltext + KG retrieval (parallel)~80 msNo
RRF fusion~2 msNo
Cross-encoder rerank (if enabled)~150 msYes
Response pack assembly~30 msNo
Total (default config, multi-session query)~900 ms
Total (voice profile, single-session query)~180 ms

The HyDE and multihop costs dominate for any query type where they run. Disabling them is the single highest-leverage latency win for voice/realtime.

Next steps

  • Profiles & Presets — see the Voice profile for a complete sub-100ms config
  • Embeddings — vector dim and model choice that feeds this pipeline
  • Benchmarking — how the LongMemEval-S number was produced with all of these defaults