All 18 LLM-related env vars — entity extraction, async enrichment, answer generation, verifier — plus how the fallback chain works.
LLM & Answer Generation
CortexDB calls LLMs in four distinct places, each independently configurable. Mixing them is the norm — for example: GPT-4o-mini for cheap entity extraction on every write, Claude Opus for the actual answer the user sees, an optional GPT-4.1 verifier on top.
The four LLM call sites
| Call site | When | Default model | Why this one |
|---|---|---|---|
| Entity extraction | On every write (sync) and during async fact emission | gpt-4o-mini | Cheap, fast, structured-output-reliable |
| Async enrichment | Background job, only if explicitly enabled | (none — disabled) | Heavyweight KG enrichment; opt-in |
| Answer generation | /v1/answer endpoint | claude-opus-4-6 | Highest score on multi-session in our A/B |
| Verifier | Optional, post-answer cross-check | gpt-4.1 | Different family from the answer model → catches model-specific failure modes |
Each of these has its own model, URL, API key, and disable switch — so you can route them to entirely different providers if you need to.
1. Entity extraction LLM
Runs synchronously on /v1/experience to build the knowledge-graph seed. If disabled, fact/belief layers degrade to text-only matching.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_LLM_DISABLE | (unset) | Set to 1 / true to skip extraction entirely. Faster writes; weaker recall. |
CORTEX_LLM_URL | https://api.openai.com/v1 | Endpoint. Use any OpenAI-compatible API. |
CORTEX_LLM_MODEL | gpt-4o-mini | Model name passed to the provider. |
CORTEX_ENTITY_API_KEY | (falls back to OPENAI_API_KEY) | Separate key — useful if you want extraction billed to a different budget. |
The extraction job is wrapped in a fallback chain (see LlmConfig below) so a primary failure auto-fails-over to your declared fallback model.
When to change:
CORTEX_LLM_DISABLE=1for benchmarks where you want pure-text recall behavior.CORTEX_LLM_MODEL=gpt-4o(not -mini) if you're seeing systematic extraction errors on your domain text. ~10× cost; usually unnecessary.CORTEX_LLM_URL=http://localhost:11434/v1+CORTEX_LLM_MODEL=qwen2.5:7bto keep extraction local via Ollama.
2. Async enrichment LLM (optional)
A heavyweight, slower pipeline that does deeper KG enrichment (multi-hop fact linking, entity disambiguation across sessions). Off by default — set the model env var to turn it on.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_ENRICHMENT_MODEL | (empty = disabled) | Set to a model name to enable. |
CORTEX_ENRICHMENT_URL | (falls back to CORTEX_LLM_URL) | Endpoint. |
CORTEX_ENRICHMENT_API_KEY | (falls back to CORTEX_ENTITY_API_KEY) | Separate key. |
When to enable: if you have multi-session workloads where entity resolution across conversations matters more than ingest cost (CRM, customer history). The async pipeline picks up backlog over the scheduler's enrichment_drain_interval_secs (default 30 s).
# Enable enrichment with a stronger model than the sync extractor
export CORTEX_ENRICHMENT_MODEL=gpt-4o
export CORTEX_ENRICHMENT_URL=https://api.openai.com/v1
3. Answer generation LLM
Used by POST /v1/answer to turn a recall pack into a cited natural-language answer.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_ANSWER_PROVIDER | anthropic | anthropic, openai, google, bedrock, ollama |
CORTEX_ANSWER_MODEL | claude-opus-4-6 | Model name. Provider-specific. |
CORTEX_ANSWER_URL | (provider default) | Override endpoint (proxies, gateways). |
CORTEX_ANSWER_API_KEY | (falls back to provider-specific env) | ANTHROPIC_API_KEY for Anthropic, OPENAI_API_KEY for OpenAI, etc. |
CORTEX_ANSWER_MAX_TOKENS | 1500 | Generation budget. Higher = longer answers, more cost. |
ANTHROPIC_API_KEY | (none) | Fallback when provider is anthropic and CORTEX_ANSWER_API_KEY unset. |
The 93.8% LongMemEval-S number was produced with claude-opus-4-6. That benchmark is the strongest evidence we have on model choice — switching answer models is the highest-leverage change you can make.
| Model | Provider | Per-query cost | LongMemEval-S delta vs Opus 4.6 |
|---|---|---|---|
claude-opus-4-6 | Anthropic | ~$0.03 | 0 (baseline, 93.8%) |
claude-sonnet-4-6 | Anthropic | ~$0.006 | ~-2 pp |
gpt-4o | OpenAI | ~$0.015 | ~-3 pp |
gpt-4o-mini | OpenAI | ~$0.001 | ~-8 pp |
gemini-2.0-flash | ~$0.002 | ~-5 pp (estimated; not formally A/B'd) |
The deltas are our internal A/B numbers on a 150-question slice; treat them as directional, not definitive. The cost-vs-accuracy frontier puts Sonnet at a very reasonable spot if Opus is too expensive.
Provider-specific notes
Anthropic: Defaults to claude-opus-4-6. Set CORTEX_ANSWER_URL to a Bedrock or Vertex AI endpoint to route through a cloud provider's hosted Claude.
OpenAI: Set CORTEX_ANSWER_PROVIDER=openai, CORTEX_ANSWER_MODEL=gpt-4o. The default CORTEX_ANSWER_URL becomes https://api.openai.com/v1.
Bedrock / Google: Set CORTEX_ANSWER_URL to the regional endpoint. Authentication uses the provider's native env vars (AWS_ACCESS_KEY_ID etc. for Bedrock).
Ollama: Set CORTEX_ANSWER_PROVIDER=ollama, CORTEX_ANSWER_URL=http://localhost:11434, CORTEX_ANSWER_MODEL=qwen2.5:14b. Local inference; ~3-15 s / answer depending on hardware and model size.
4. Verifier LLM (optional, off by default)
A second LLM call that critiques the answer for hallucination against the citation pack. Catches "model said X but the cited source said Y" failures. Doubles latency and cost when enabled.
| Env var | Default | What it controls |
|---|---|---|
CORTEX_ANSWER_USE_VERIFIER | (unset) | Legacy enable flag. Overridden by CORTEX_ANSWER_VERIFIER_TYPES if both are set. |
CORTEX_ANSWER_VERIFIER_TYPES | (empty) | Comma-separated question types where the verifier runs. |
CORTEX_VERIFIER_MODEL | gpt-4.1 | Verifier model. Use a different family from the answer model. |
CORTEX_VERIFIER_URL | (falls back to CORTEX_ANSWER_URL) | Endpoint. |
CORTEX_VERIFIER_API_KEY | (falls back to CORTEX_ANSWER_API_KEY) | Key. |
CORTEX_VERIFIER_MAX_TOKENS | 16384 | Verifier output budget. Generous default — verifier can be verbose. |
Question types are: single-session-user, single-session-assistant, multi-session, open-domain. Enable selectively where false-confidence is most expensive:
# Verify only multi-session and open-domain answers (where the failure rate is highest)
export CORTEX_ANSWER_VERIFIER_TYPES=multi-session,open-domain
export CORTEX_VERIFIER_MODEL=gpt-4.1
Why a different family: if the verifier is the same model as the generator, it tends to share the generator's blind spots. We default to GPT-4.1 as a verifier for Claude-Opus answers and vice-versa.
The fallback chain (in cortex.toml)
The [llm] section of cortex.toml declares a fallback chain that any LLM call (extraction or answer) can use when the primary provider is unreachable:
[llm]
provider = "openai"
endpoint = "" # empty → resolve from provider default
api_key = "" # empty → resolve from OPENAI_API_KEY env
model = "gpt-4o-mini"
fallback_provider = "anthropic"
fallback_endpoint = "" # empty → https://api.anthropic.com/v1
fallback_api_key = "" # empty → ANTHROPIC_API_KEY env
fallback_model = "claude-sonnet-4-6"
fallback_chain = ["openai", "anthropic", "google"] # tried in order on cascading failures
max_extraction_batch_size = 8 # entity-extraction items per LLM call
extraction_timeout_ms = 30000 # per-call timeout
Resolution semantics:
- If
api_keyis set in TOML, use it. Else, look up the env var for the named provider (OPENAI_API_KEYfor openai,ANTHROPIC_API_KEYfor anthropic,LLM_API_KEYfor anything else). - Same pattern for
endpoint: TOML field wins, else use the provider's canonical default.
Env-var CORTEX_* overrides at the call-site level take precedence over the TOML defaults — the TOML chain is a floor, the env vars are the ceiling.
Cost ladder
Sample monthly cost for an agent that writes 10 K events/day and answers 1 K queries/day:
| Configuration | Embedding | Extraction | Answer | Verifier | Total |
|---|---|---|---|---|---|
| Cost-Optimized (all GPT-4o-mini) | $5 | $8 | $10 | — | $23 |
| Default (sm embed + 4o-mini ext + Opus ans) | $5 | $8 | $90 | — | $103 |
| With verifier on multi-session | $5 | $8 | $90 | $30 | $133 |
| Premium (lg embed + 4o ext + Opus + Verifier) | $20 | $80 | $90 | $30 | $220 |
The Opus answer call dominates; cutting it is the largest available cost lever. Sonnet (~$18/mo at the same volume) is the right swap for cost-sensitive deployments that still want a strong answer model.
Diagnostics
Every authenticated /v1/answer response includes the model that produced it:
{
"answer": "...",
"citations": [...],
"model_used": "claude-opus-4-6"
}
If model_used doesn't match what you set, one of:
- Your env var has a typo and the binary fell back to the compiled default.
- The fallback chain triggered (primary provider was unreachable).
- A scope-level policy override is forcing a different model. Check
/v1/policy/effective?actor=....
Next steps
- Embeddings — the vector side of the inference bill
- Recall Tuning — how to feed the answer model better context
- Profiles & Presets — see the Cost-Optimized profile for an all-mini deployment