Use a self-hosted vLLM server for CortexDB embeddings and entity extraction.

vLLM Provider

Run CortexDB's embedding and entity-extraction pipelines on your own vLLM inference server. Full control over models, hardware, and data residency.

Overview

vLLM is a high-throughput inference engine for large language models. Its OpenAI-compatible server lets you serve any supported model. This integration configures CortexDB to use vLLM for:

  • Embedding generation — serve any embedding model via vLLM
  • Entity extraction — serve any chat model for relationship extraction

Installation

pip install cortexdb-vllm

Setting Up vLLM

# Install vLLM
pip install vllm

# Serve an embedding model
python -m vllm.entrypoints.openai.api_server \
    --model BAAI/bge-large-en-v1.5 \
    --port 8000

# Or serve a chat model for entity extraction
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --port 8001

Configuration

| Environment Variable | Default | Description | |---|---|---| | CORTEX_VLLM_URL | http://localhost:8000 | vLLM server URL | | CORTEX_VLLM_EMBED_MODEL | (none, required) | Embedding model served by vLLM | | CORTEX_VLLM_CHAT_MODEL | (none) | Chat model for entity extraction | | CORTEX_VLLM_API_KEY | (none, optional) | API key if vLLM server requires auth |

Usage Example

from cortexdb_vllm import VLLMEmbeddingProvider, VLLMConfig

config = VLLMConfig(
    base_url="http://localhost:8000",
    embed_model="BAAI/bge-large-en-v1.5",
)

async with VLLMEmbeddingProvider(config=config, dimension=1024) as provider:
    embedding = await provider.embed_query("What is event sourcing?")
    print(f"Dimension: {provider.dimension}")  # 1024

    embeddings = await provider.embed([
        "Event sourcing stores all changes as events.",
        "CQRS separates reads from writes.",
    ])

Self-Hosted Setup

vLLM is self-hosted by design. For production deployments:

# Serve with multiple GPUs
python -m vllm.entrypoints.openai.api_server \
    --model BAAI/bge-large-en-v1.5 \
    --tensor-parallel-size 2 \
    --port 8000 \
    --host 0.0.0.0

# Optional: add API key authentication
python -m vllm.entrypoints.openai.api_server \
    --model BAAI/bge-large-en-v1.5 \
    --api-key your-secret-key \
    --port 8000

Point CortexDB at your vLLM instance:

export CORTEX_VLLM_URL=http://gpu-server:8000
export CORTEX_VLLM_EMBED_MODEL=BAAI/bge-large-en-v1.5
export CORTEX_VLLM_API_KEY=your-secret-key  # if auth is enabled

Switching Providers

To switch CortexDB from the default OpenAI embeddings to vLLM:

from cortexdb_vllm import VLLMEmbeddingProvider

provider = VLLMEmbeddingProvider()  # reads from env vars

All CortexDB embedding providers implement the same interface (embed, embed_query, dimension, model_name), so switching is a one-line change.