Building an AI Companion with Perfect Memory

The difference between a useful AI assistant and a genuinely compelling AI companion is memory. This tutorial walks through building one with OpenAI's GPT-4o and CortexDB.

What makes a good AI companion

Three properties separate a companion from a chatbot:

Personalization

The companion knows that you prefer concise answers, that you are working on a startup, that you have a meeting with investors on Thursday, and that you are allergic to shellfish. It does not ask you to repeat information.

Continuity

A conversation from three months ago is as accessible as one from three minutes ago. The companion can reference past discussions, recall decisions you made, and track how your projects have evolved.

Context

When you say "How's the fundraise going?", the companion knows which fundraise, which investors you have talked to, what feedback they gave, and what your next steps are — even if that information was spread across a dozen separate conversations.

Most AI assistants fail at all three because they have no memory layer. CortexDB provides one.

Architecture

User Interface  -->  OpenAI GPT-4o  <-->  CortexDB Memory

Flow:
1. User sends message
2. Recall relevant memories from CortexDB
3. Send message + memories to GPT-4o
4. Return response to user
5. Store the conversation turn in CortexDB

The key insight is that CortexDB sits alongside the LLM, not inside it. The LLM handles reasoning and generation. CortexDB handles storage and retrieval. Each does what it is best at.

Step 1: Setup

pip install cortexdbai openai

Step 2: Initialize the clients

from cortexdb import Cortex
from openai import OpenAI

cortex = Cortex(
    base_url="https://api-v1.cortexdb.ai",
    api_key="your-api-key",
)

openai_client = OpenAI(api_key="your-openai-api-key")

# Each user gets their own tenant for complete isolation
USER_ID = "user-alice"

Step 3: The memory-augmented chat function

This is the core of the companion. Every message goes through three steps: recall, generate, remember.

def chat(user_message: str, user_id: str = USER_ID) -> str:
    # Step 1: Recall relevant memories
    memories = cortex.recall(
        query=user_message,
        tenant_id=user_id,
    )

    # Step 2: Build the memory context for the LLM
    memory_context = format_memories(memories)

    # Step 3: Generate a response with memory context
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": SYSTEM_PROMPT + "\n\n" + memory_context
            },
            {
                "role": "user",
                "content": user_message
            }
        ],
        temperature=0.7,
        max_tokens=1024
    )

    assistant_message = response.choices[0].message.content

    # Step 4: Store this conversation turn
    cortex.remember(
        content=f"User: {user_message}\n\nAssistant: {assistant_message}",
        tenant_id=user_id,
    )

    return assistant_message


def format_memories(memories) -> str:
    if not memories.context:
        return "## Memory Context\nNo relevant memories found."
    return f"## Memory Context\n\n{memories.context}"

Step 4: Conversation loop

def main():
    print("AI Companion (type 'quit' to exit)")
    print("-" * 40)

    while True:
        user_input = input("\nYou: ").strip()
        if user_input.lower() in ("quit", "exit", "q"):
            break

        response = chat(user_input)
        print(f"\nCompanion: {response}")


if __name__ == "__main__":
    main()

Step 5: See it in action

Here is a multi-session interaction showing the companion's memory at work:

Session 1 (Monday morning)

You: I'm starting a new project at work. We're building a recommendation
     engine for our e-commerce platform. Using Python and TensorFlow.

Companion: That sounds like a great project. A recommendation engine can
have a huge impact on conversion rates. Are you looking at collaborative
filtering, content-based filtering, or a hybrid approach? And what kind
of data volume are you working with?

You: We're going with a hybrid approach. We have about 50M user
     interactions and 2M products. The tricky part is real-time inference
     -- we need sub-100ms latency at the API layer.

Companion: 50M interactions across 2M products is a solid dataset for
hybrid filtering. For sub-100ms inference, you'll want to look at
TensorFlow Serving with model warmup. Have you considered using an
approximate nearest neighbor index (like ScaNN or FAISS) for candidate
generation? That's usually where the latency budget gets eaten up.

Session 2 (Wednesday afternoon, separate session)

You: We ran into a problem with the recommendation engine.

Companion: What's happening with the hybrid recommendation engine? Is it
a latency issue with the real-time inference, or something with the model
training on the 50M interaction dataset?

You: Latency. We're at 200ms and need to get to 100ms. The candidate
     generation step is the bottleneck, just like you mentioned.

Companion: Right, candidate generation is usually the bottleneck in
hybrid systems at your scale. Since you mentioned TensorFlow earlier:

1. ScaNN (Google's ANN library) -- integrates well with TensorFlow and
   handles your 2M product catalog easily.

2. Two-tower architecture -- train separate user and item embedding
   towers, precompute item embeddings, and do ANN lookup at serving time.

3. Caching -- if your product catalog doesn't change frequently, cache
   the top-K candidates per user segment.

Session 3 (Two weeks later)

You: Update on the recommendation engine -- we went with ScaNN and got
     latency down to 45ms. Ship date is April 1st.

Companion: That's a great result -- from 200ms down to 45ms is more than
a 4x improvement, well under your 100ms target. ScaNN was a good choice
for your 2M product catalog. April 1st ship date noted. Is there
anything else you need to prepare before launch?

The companion remembered the project, the tech stack, the data scale, the latency requirement, and its own earlier suggestions. All context was retrieved from CortexDB and injected into the GPT-4o prompt.

Advanced: Preference learning

The companion can track preferences by storing them as explicit memories when detected:

def detect_and_store_preferences(user_message, assistant_response, user_id):
    detection = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """Analyze this conversation turn and extract any
user preferences. Return a JSON array of preferences, or an empty array
if none detected.
Format: [{"category": "...", "preference": "...", "confidence": 0.0-1.0}]"""
            },
            {
                "role": "user",
                "content": f"User: {user_message}\nAssistant: {assistant_response}"
            }
        ],
        temperature=0.0,
        response_format={"type": "json_object"}
    )

    import json
    result = json.loads(detection.choices[0].message.content)
    preferences = result.get("preferences", [])

    for pref in preferences:
        if pref["confidence"] >= 0.7:
            cortex.remember(
                content=f"User preference: {pref['preference']}",
                tenant_id=user_id,
            )

Advanced: Proactive context

The most powerful feature of a memory-equipped companion is proactive context — surfacing relevant information the user did not explicitly ask for:

def chat_with_proactive_context(user_message, user_id=USER_ID):
    # Recall memories for the direct query
    direct_memories = cortex.recall(
        query=user_message,
        tenant_id=user_id,
    )

    # Also recall potentially relevant context
    proactive_memories = cortex.recall(
        query=f"upcoming deadlines or context related to: {user_message}",
        tenant_id=user_id,
    )

    memory_context = format_memories(direct_memories)
    proactive_context = format_proactive(proactive_memories)

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": SYSTEM_PROMPT + "\n\n" + memory_context
                    + "\n\n" + proactive_context
            },
            {"role": "user", "content": user_message}
        ],
        temperature=0.7,
        max_tokens=1024
    )

    assistant_message = response.choices[0].message.content

    cortex.remember(
        content=f"User: {user_message}\n\nAssistant: {assistant_message}",
        tenant_id=user_id,
    )

    return assistant_message

Now when the user says "I'm preparing for a meeting tomorrow", the companion can proactively mention: "By the way, you mentioned last week that you wanted to bring up the recommendation engine latency results. The ScaNN integration got you down to 45ms — that might be worth sharing."

Production considerations

Tenant isolation

Each user gets their own tenant_id. CortexDB provides tenant-level isolation, so one user's memories are never accessible to another.

Latency budget

The memory-augmented chat adds one network round-trip to CortexDB (typically 10–50ms) on top of the OpenAI API call. In practice, the CortexDB recall completes before the LLM starts generating tokens.

Cost

CortexDB adds no LLM cost on the write path. The only additional cost is the extra tokens in the prompt from memory context (~500–2,000 tokens per query). At GPT-4o pricing, this is approximately $0.005–$0.02 per query.

Memory pruning

Over months of use, the episode count grows. CortexDB handles this at query time by selecting the most relevant results. For storage management, use the forget API to remove old memories.

Try CortexDB today

Build your own AI companion in under 5 minutes. Free tier available — no credit card required.

Get Started · Read the Docs