Every embedding-related knob: provider, model, dimensions, batch size, retries, and the constraints that link them to the HNSW index.

Embeddings

CortexDB calls an embedding service on every event ingest (one vector per chunk) and on every recall query (one vector for the query, plus N for HyDE-expanded queries). Embeddings are the single largest line item in your inference bill for most workloads.

The core constraint

The embedding service's output dimension must match the engine's vector index dimension. They are configured separately and CortexDB does not check them against each other at startup — a mismatch produces silent recall failures (everything returns 0 results).

Setting	Where	Default
Embedding output dim	`CORTEX_EMBEDDING_DIMS` env var	`1536`
Index storage dim	`cortex.toml` → `[engine] vector_dimensions`	`3072`

The defaults don't match each other. This is a known pitfall: the env-var default targets text-embedding-3-small (1536), and the TOML default targets text-embedding-3-large (3072). Pick one model and set both consistently. The vector_dimensions field accepts only {256, 384, 512, 768, 1024, 1536, 3072} — anything else fails schema validation at startup.

A correct minimal config for text-embedding-3-small:

# cortex.toml
[engine]
vector_dimensions = 1536

export CORTEX_EMBEDDING_MODEL=text-embedding-3-small
export CORTEX_EMBEDDING_DIMS=1536

Provider selection

export CORTEX_EMBEDDING_PROVIDER=<empty> | mock | cohere | ollama

Value	Behavior	Required env
(empty / unset)	HTTP service against `CORTEX_EMBEDDING_URL`. Default.	`OPENAI_API_KEY` or `LLM_API_KEY`
`mock`	Deterministic mock embeddings (384 d). Development only.	—
`cohere`	Cohere's embed API.	`COHERE_API_KEY`
`ollama` (also: any URL containing `:11434`)	Local Ollama daemon. No auth required.	`CORTEX_EMBEDDING_URL=http://localhost:11434`

The "any URL containing :11434" detection is a convenience — if you set CORTEX_EMBEDDING_URL=http://localhost:11434 without explicitly setting the provider, you get Ollama mode automatically.

The provider pin

On first boot, the server writes an embedding_provider.pin file (recording provider:model:dims) into the data directory. On every later start, the configured provider is checked against the pin — a mismatch is startup-fatal, with guidance to re-index. This means a missing API key can no longer silently downgrade a real corpus to mock vectors.

To deliberately migrate a data directory to a different embedding provider or model, set CORTEX_EMBEDDING_ALLOW_REPIN=1 for one restart (and plan to re-embed the corpus — vectors from different models are not comparable).

When the API key is missing

If no OPENAI_API_KEY / LLM_API_KEY / COHERE_API_KEY is set and Ollama isn't detected, the binary boots on mock embeddings (384-d vectors) with a loud warning, and the data directory is pinned as mock. The readiness endpoint (GET /v1/admin/ready) reports degraded: true and the pinned provider, so orchestrators and smoke tests can catch it.

Mock embeddings are a development convenience and produce meaningless recall. CORTEX_EMBEDDING_PROVIDER=mock is the explicit way to request them; if you didn't ask for mock, check your startup logs and readiness output. Because of the pin, adding a real API key later to a mock-pinned data dir will refuse to start until you repin (see above) or point at a fresh data dir.

The full env-var surface

Env var	Default	What it controls
`CORTEX_EMBEDDING_URL`	`https://api.openai.com/v1`	Base URL of the embedding HTTP API.
`CORTEX_EMBEDDING_MODEL`	`text-embedding-3-small`	Model name passed to the provider.
`CORTEX_EMBEDDING_DIMS`	`1536`	Output dimension. Must match `engine.vector_dimensions`.
`CORTEX_EMBEDDING_PROVIDER`	(empty)	`mock`, `cohere`, `ollama`, or empty for OpenAI-compatible HTTP.
`CORTEX_EMBEDDING_ALLOW_REPIN`	(unset)	Set `=1` for one restart to deliberately change the pinned provider/model/dims.
`CORTEX_EMBEDDING_MAX_BATCH_ITEMS`	`2048`	Max items per provider call before client splits.
`CORTEX_EMBEDDING_RETRY_ATTEMPTS`	`1`	How many times to retry on transient error before failing the request.
`CORTEX_EMBEDDING_RETRY_BASE_DELAY_MS`	`250`	Initial backoff between retries (exponential).
`OPENAI_API_KEY`	(none)	Primary key for OpenAI / HTTP mode.
`LLM_API_KEY`	(none)	Generic fallback if `OPENAI_API_KEY` not set.
`COHERE_API_KEY`	(none)	Required if `CORTEX_EMBEDDING_PROVIDER=cohere`.

Choosing a model

Model	Provider	Dims	Cost / 1M tok	When to pick it
`text-embedding-3-small`	OpenAI	1536	$0.020	Default. Used in our published 93.8% number.
`text-embedding-3-large`	OpenAI	3072	$0.130	~+0.4pp on LongMemEval-S, ~3× cost.
`embed-multilingual-v3.0`	Cohere	1024	$0.100	Strong on non-English content.
`nomic-embed-text`	Ollama (local)	768	$0	Local, free, ~5pp worse than OpenAI small.
`mxbai-embed-large`	Ollama (local)	1024	$0	Best local option; ~3pp worse than OpenAI small.
`bge-large-en-v1.5`	Ollama (local)	1024	$0	Strong English-only local; similar to mxbai.

The published LongMemEval-S configuration uses text-embedding-3-small not because it's the best, but because the +0.4pp from large didn't justify the 3× cost — we wanted the published config to be the one we'd recommend to most users. If you have an unusual budget, swap to large.

Batch and retry tuning

The embedding service collects pending requests and packs them into provider calls of up to CORTEX_EMBEDDING_MAX_BATCH_ITEMS items. Larger batches reduce per-request overhead and improve throughput, but increase the latency for the first request in a batch.

Workload	`MAX_BATCH_ITEMS`	`RETRY_ATTEMPTS`	Why
Realtime / voice	`256`	`1`	Small batches don't gather large enough to be worth waiting for; fail fast
Default / mixed	`2048`	`1`	Compiled default, good for most
Batch / ingest	`4096`	`3`	Pack OpenAI calls, tolerate 429s with backoff
Cohere	`96`	`2`	Cohere's per-call cap is lower than OpenAI's

OpenAI's text-embedding-3-* models accept up to 2048 inputs per call. Setting CORTEX_EMBEDDING_MAX_BATCH_ITEMS higher than that doesn't error — the client splits transparently — but you stop getting per-batch overhead amortization.

Local-first with Ollama

For development without an API key:

# Install Ollama and pull the model
ollama pull nomic-embed-text

# Point CortexDB at it
export CORTEX_EMBEDDING_URL=http://localhost:11434
export CORTEX_EMBEDDING_MODEL=nomic-embed-text
export CORTEX_EMBEDDING_DIMS=768

# cortex.toml — match the dim
[engine]
vector_dimensions = 768

Ollama runs on the same machine as CortexDB. Expect ~30 ms / embedding on a modern CPU, much faster on a GPU. Throughput is the limit, not latency.

Caching

All HTTP embedding services are wrapped in LruEmbeddingCache, an in-process LRU keyed by (model, text_hash). The cache survives the process lifetime; it's lost on restart.

The cache size isn't currently env-configurable — it sits at the compiled default (~10 K entries). For workloads that re-embed the same text repeatedly (notably reruns of the same eval set), the cache is highly effective; for cold ingest workloads it's near-useless.

Diagnostics

On startup, the binary logs which embedding service it picked:

info  Using HTTP embedding service model=text-embedding-3-small dims=1536
info  Using Ollama local embedding service model=nomic-embed-text dims=768 url=http://localhost:11434
info  Using Cohere embedding service model=embed-multilingual-v3.0 dims=1024
warn  No OPENAI_API_KEY or LLM_API_KEY set -- falling back to mock embeddings.

Always grep for "embedding service" in your startup logs after changing config, and check GET /v1/admin/ready — it reports the pinned embedding provider and degraded: true when the corpus is on mock. Recall failures are usually a missing API key or a dim mismatch; provider/model changes against an existing data dir fail fast at startup thanks to the provider pin.

Next steps

LLM & Answer Generation — the entity-extraction and answer-generation LLMs
Recall Tuning — how embeddings feed into the recall pipeline
Storage & Cluster — the HNSW index that consumes these vectors