Use a self-hosted vLLM server for CortexDB embeddings and entity extraction.
vLLM Provider
Run CortexDB's embedding and entity-extraction pipelines on your own vLLM inference server. Full control over models, hardware, and data residency.
Overview
vLLM is a high-throughput inference engine for large language models. Its OpenAI-compatible server lets you serve any supported model. This integration configures CortexDB to use vLLM for:
- Embedding generation — serve any embedding model via vLLM
- Entity extraction — serve any chat model for relationship extraction
Installation
pip install cortexdbai[vllm]
Setting Up vLLM
# Install vLLM
pip install vllm
# Serve an embedding model
python -m vllm.entrypoints.openai.api_server \
--model BAAI/bge-large-en-v1.5 \
--port 8000
# Or serve a chat model for entity extraction
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 8001
Configuration
| Environment Variable | Default | Description |
|---|---|---|
| CORTEX_VLLM_URL | http://localhost:8000 | vLLM server URL |
| CORTEX_VLLM_EMBED_MODEL | (none, required) | Embedding model served by vLLM |
| CORTEX_VLLM_CHAT_MODEL | (none) | Chat model for entity extraction |
| CORTEX_VLLM_API_KEY | (none, optional) | API key if vLLM server requires auth |
Usage Example
from cortexdb_vllm import VLLMEmbeddingProvider, VLLMConfig
config = VLLMConfig(
base_url="http://localhost:8000",
embed_model="BAAI/bge-large-en-v1.5",
)
async with VLLMEmbeddingProvider(config=config, dimension=1024) as provider:
embedding = await provider.embed_query("What is event sourcing?")
print(f"Dimension: {provider.dimension}") # 1024
embeddings = await provider.embed([
"Event sourcing stores all changes as events.",
"CQRS separates reads from writes.",
])
Self-Hosted Setup
vLLM is self-hosted by design. For production deployments:
# Serve with multiple GPUs
python -m vllm.entrypoints.openai.api_server \
--model BAAI/bge-large-en-v1.5 \
--tensor-parallel-size 2 \
--port 8000 \
--host 0.0.0.0
# Optional: add API key authentication
python -m vllm.entrypoints.openai.api_server \
--model BAAI/bge-large-en-v1.5 \
--api-key your-secret-key \
--port 8000
Point CortexDB at your vLLM instance:
export CORTEX_VLLM_URL=http://gpu-server:8000
export CORTEX_VLLM_EMBED_MODEL=BAAI/bge-large-en-v1.5
export CORTEX_VLLM_API_KEY=your-secret-key # if auth is enabled
Switching Providers
To switch CortexDB from the default OpenAI embeddings to vLLM:
from cortexdb_vllm import VLLMEmbeddingProvider
provider = VLLMEmbeddingProvider() # reads from env vars
All CortexDB embedding providers implement the same interface (embed, embed_query, dimension, model_name), so switching is a one-line change.
Under the Hood
When using the vLLM provider, the SDK translates your calls into REST API requests against the CortexDB and vLLM endpoints.
Storing a memory (remember)
# SDK: cortex.remember("We use mTLS for all inter-service communication.")
curl -X POST https://api.cortexdb.ai/v1/remember \
-H "Authorization: Bearer $CORTEX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"content": "We use mTLS for all inter-service communication.",
"tenant_id": "my-app"
}'
# Returns: { "event_id": "evt_abc123" }
Retrieving context (recall)
# SDK: result = cortex.recall("How do services authenticate with each other?")
# result.context, result.confidence, result.latency_ms
curl -X POST https://api.cortexdb.ai/v1/recall \
-H "Authorization: Bearer $CORTEX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "How do services authenticate with each other?",
"tenant_id": "my-app"
}'
# Returns: { "context": "...", "confidence": 0.89, "latency_ms": 55 }
Generating embeddings (vLLM)
# The provider calls vLLM's OpenAI-compatible embedding endpoint
curl -X POST http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-large-en-v1.5",
"input": "How do services authenticate with each other?"
}'
# Returns: { "data": [{ "embedding": [0.123, -0.456, ...] }] }