Route /v1/answer through a self-hosted vLLM server.
vLLM Integration
vLLM serves LLMs locally with high throughput and an OpenAI-compatible API. Pair it with CortexDB for fully self-hosted inference.
Run vLLM
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--host 0.0.0.0 --port 8000 \
--api-key my-vllm-key
CortexDB deployment configuration
CORTEX_LLM_URL=http://vllm-host:8000/v1
CORTEX_LLM_API_KEY=my-vllm-key
CORTEX_LLM_MODEL=meta-llama/Llama-3.3-70B-Instruct
Per-request override
client.answer(
scope="org:acme/user:alice",
question="What did we decide?",
answer_model="vllm/meta-llama/Llama-3.3-70B-Instruct",
)
Together with Ollama for embeddings, this gives you a 100% self-hosted CortexDB stack with no third-party LLM calls.