Use a locally-hosted Ollama model as the answer model and entity-extraction LLM for CortexDB.

Ollama Integration

CortexDB routes its LLM calls (entity extraction, episode synthesis, /v1/answer) through the configured LLM router. Point that router at an Ollama instance to keep all inference local — no data leaves your network.

Deployment configuration

Set these on the cortex server (typically in /root/cortex/.env for self-hosted):

CORTEX_LLM_URL=http://localhost:11434/v1
CORTEX_LLM_MODEL=llama3.1:70b
CORTEX_LLM_API_KEY=ollama                     # placeholder; Ollama ignores it
CORTEX_EMBEDDING_PROVIDER=ollama
CORTEX_EMBEDDING_MODEL=nomic-embed-text
CORTEX_EMBEDDING_DIMS=768

Restart the cortex service to pick up the new config:

systemctl restart cortexdb.service

Per-request answer-model override

Once the router knows about Ollama, you can pin individual /v1/answer calls to specific local models:

client.answer(
    scope="org:acme/user:alice",
    question="What did we decide about the launch?",
    answer_model="ollama/llama3.1:70b",
)

The answer_model field accepts any <provider>/<model-name> the configured router supports.

Pulling models

ollama pull llama3.1:70b
ollama pull nomic-embed-text

For benchmark-grade accuracy on LongMemEval-S we recommend the largest model your GPU fits (Llama 3.1 70B or Qwen 2.5 72B). Smaller models will run faster but lose 5–15 pp on multi-session reasoning.

Ollama Integration

Deployment configuration

Per-request answer-model override

Pulling models

See also