Use a self-hosted vLLM server for CortexDB embeddings and entity extraction.
vLLM Provider
Run CortexDB's embedding and entity-extraction pipelines on your own vLLM inference server. Full control over models, hardware, and data residency.
Overview
vLLM is a high-throughput inference engine for large language models. Its OpenAI-compatible server lets you serve any supported model. This integration configures CortexDB to use vLLM for:
- Embedding generation — serve any embedding model via vLLM
- Entity extraction — serve any chat model for relationship extraction
Installation
pip install cortexdb-vllm
Setting Up vLLM
# Install vLLM
pip install vllm
# Serve an embedding model
python -m vllm.entrypoints.openai.api_server \
--model BAAI/bge-large-en-v1.5 \
--port 8000
# Or serve a chat model for entity extraction
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 8001
Configuration
| Environment Variable | Default | Description |
|---|---|---|
| CORTEX_VLLM_URL | http://localhost:8000 | vLLM server URL |
| CORTEX_VLLM_EMBED_MODEL | (none, required) | Embedding model served by vLLM |
| CORTEX_VLLM_CHAT_MODEL | (none) | Chat model for entity extraction |
| CORTEX_VLLM_API_KEY | (none, optional) | API key if vLLM server requires auth |
Usage Example
from cortexdb_vllm import VLLMEmbeddingProvider, VLLMConfig
config = VLLMConfig(
base_url="http://localhost:8000",
embed_model="BAAI/bge-large-en-v1.5",
)
async with VLLMEmbeddingProvider(config=config, dimension=1024) as provider:
embedding = await provider.embed_query("What is event sourcing?")
print(f"Dimension: {provider.dimension}") # 1024
embeddings = await provider.embed([
"Event sourcing stores all changes as events.",
"CQRS separates reads from writes.",
])
Self-Hosted Setup
vLLM is self-hosted by design. For production deployments:
# Serve with multiple GPUs
python -m vllm.entrypoints.openai.api_server \
--model BAAI/bge-large-en-v1.5 \
--tensor-parallel-size 2 \
--port 8000 \
--host 0.0.0.0
# Optional: add API key authentication
python -m vllm.entrypoints.openai.api_server \
--model BAAI/bge-large-en-v1.5 \
--api-key your-secret-key \
--port 8000
Point CortexDB at your vLLM instance:
export CORTEX_VLLM_URL=http://gpu-server:8000
export CORTEX_VLLM_EMBED_MODEL=BAAI/bge-large-en-v1.5
export CORTEX_VLLM_API_KEY=your-secret-key # if auth is enabled
Switching Providers
To switch CortexDB from the default OpenAI embeddings to vLLM:
from cortexdb_vllm import VLLMEmbeddingProvider
provider = VLLMEmbeddingProvider() # reads from env vars
All CortexDB embedding providers implement the same interface (embed, embed_query, dimension, model_name), so switching is a one-line change.