Route /v1/answer through a self-hosted vLLM server.

vLLM Integration

vLLM serves LLMs locally with high throughput and an OpenAI-compatible API. Pair it with CortexDB for fully self-hosted inference.

Run vLLM

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --api-key my-vllm-key

CortexDB deployment configuration

CORTEX_LLM_URL=http://vllm-host:8000/v1
CORTEX_LLM_API_KEY=my-vllm-key
CORTEX_LLM_MODEL=meta-llama/Llama-3.3-70B-Instruct

Per-request override

client.answer(
    scope="org:acme/user:alice",
    question="What did we decide?",
    answer_model="vllm/meta-llama/Llama-3.3-70B-Instruct",
)

Together with Ollama for embeddings, this gives you a 100% self-hosted CortexDB stack with no third-party LLM calls.

See also