A step-by-step guide to building an AI companion that remembers everything -- preferences, past conversations, decisions, and context -- using OpenAI and CortexDB.
Building an AI Companion with Perfect Memory
The difference between a useful AI assistant and a genuinely compelling AI companion is memory. An assistant answers questions. A companion remembers your preferences, recalls past conversations, learns your communication style, and builds context over time.
This tutorial walks through building an AI companion with persistent, accurate, long-term memory using OpenAI's GPT-4o and CortexDB. By the end, you will have a companion that:
- Remembers every conversation across sessions
- Tracks user preferences and learns from corrections
- Builds connected context about people, projects, and decisions in the user's life
- Answers questions that require reasoning across months of context
What Makes a Good AI Companion
Three properties separate a companion from a chatbot:
Personalization. The companion knows that you prefer concise answers, that you are working on a startup, that you have a meeting with investors on Thursday, and that you are allergic to shellfish. It does not ask you to repeat information.
Continuity. A conversation from three months ago is as accessible as one from three minutes ago. The companion can reference past discussions, recall decisions you made, and track how your projects have evolved.
Context. When you say "How's the fundraise going?", the companion knows which fundraise, which investors you have talked to, what feedback they gave, and what your next steps are -- even if that information was spread across a dozen separate conversations.
Most AI assistants fail at all three because they have no memory layer. CortexDB provides one.
Architecture
┌──────────────────────────────────────────────────┐
│ Your Application │
│ │
│ ┌────────────┐ ┌──────────┐ ┌───────────┐ │
│ │ User │ │ OpenAI │ │ CortexDB │ │
│ │ Interface │───>│ GPT-4o │<──>│ Memory │ │
│ │ │<───│ │ │ Layer │ │
│ └────────────┘ └──────────┘ └───────────┘ │
│ │
│ Flow: │
│ 1. User sends message │
│ 2. Recall relevant memories from CortexDB │
│ 3. Send message + memories to GPT-4o │
│ 4. Return response to user │
│ 5. Store the conversation turn in CortexDB │
└──────────────────────────────────────────────────┘
The key insight is that CortexDB sits alongside the LLM, not inside it. The LLM handles reasoning and generation. CortexDB handles storage and retrieval. Each does what it is best at.
Step 1: Setup
Install the dependencies:
pip install cortexdb openai
Start CortexDB locally (or use the cloud API):
docker run -p 8080:8080 cortexdb/cortexdb:latest
Step 2: Initialize the Clients
from cortexdb import Cortex
from openai import OpenAI
from datetime import datetime, timezone
cortex = Cortex(
base_url="http://localhost:8080",
)
openai_client = OpenAI(api_key="your-openai-api-key")
# Each user gets their own tenant for complete isolation
USER_ID = "user-alice"
Step 3: The Memory-Augmented Chat Function
This is the core of the companion. Every message goes through three steps: recall, generate, remember.
def chat(user_message: str, user_id: str = USER_ID) -> str:
"""
Send a message to the AI companion with full memory context.
"""
# Step 1: Recall relevant memories
memories = cortex.recall(
query=user_message,
tenant_id=user_id,
)
# Step 2: Build the memory context for the LLM
memory_context = format_memories(memories)
# Step 3: Generate a response with memory context
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": SYSTEM_PROMPT + "\n\n" + memory_context
},
{
"role": "user",
"content": user_message
}
],
temperature=0.7,
max_tokens=1024
)
assistant_message = response.choices[0].message.content
# Step 4: Store this conversation turn as a new memory
cortex.remember(
content=f"User: {user_message}\n\nAssistant: {assistant_message}",
tenant_id=user_id,
metadata={
"source": "companion",
"user_message": user_message,
"assistant_message": assistant_message,
},
)
return assistant_message
SYSTEM_PROMPT = """You are a personal AI companion with perfect memory. You have
access to the full history of your conversations with this user. Use the provided
memory context to personalize your responses, reference past conversations when
relevant, and maintain continuity across sessions.
Rules:
- Reference past conversations naturally, not mechanically ("As you mentioned
last week..." not "According to memory ID 4521...")
- If you remember something relevant that the user hasn't brought up, mention it
when appropriate
- Track preferences and apply them (communication style, interests, goals)
- If memories contain contradictory information, note the most recent version
but acknowledge the change
- Never fabricate memories. If you don't remember something, say so.
"""
def format_memories(memories) -> str:
"""Format recalled memories as context for the LLM."""
if not memories.context:
return "## Memory Context\nNo relevant memories found."
return f"## Memory Context\n\n{memories.context}"
Step 4: Conversation Loop
def main():
print("AI Companion (type 'quit' to exit)")
print("-" * 40)
while True:
user_input = input("\nYou: ").strip()
if user_input.lower() in ("quit", "exit", "q"):
break
response = chat(user_input)
print(f"\nCompanion: {response}")
if __name__ == "__main__":
main()
Step 5: See It in Action
Here is a multi-session interaction showing the companion's memory at work:
Session 1 (Monday morning)
You: I'm starting a new project at work. We're building a recommendation engine
for our e-commerce platform. Using Python and TensorFlow.
Companion: That sounds like a great project. A recommendation engine can have a
huge impact on conversion rates. Are you looking at collaborative filtering,
content-based filtering, or a hybrid approach? And what kind of data volume are
you working with -- that'll influence your TensorFlow architecture choices.
You: We're going with a hybrid approach. We have about 50M user interactions and
2M products. The tricky part is real-time inference -- we need sub-100ms
latency at the API layer.
Companion: 50M interactions across 2M products is a solid dataset for hybrid
filtering. For sub-100ms inference, you'll want to look at TensorFlow Serving
with model warmup and batched predictions. Have you considered using an
approximate nearest neighbor index (like ScaNN or FAISS) for the candidate
generation step? That's usually where the latency budget gets eaten up.
Session 2 (Wednesday afternoon, separate session)
You: We ran into a problem with the recommendation engine.
Companion: What's happening with the hybrid recommendation engine? Is it a
latency issue with the real-time inference, or something with the model training
on the 50M interaction dataset?
You: Latency. We're at 200ms and need to get to 100ms. The candidate generation
step is the bottleneck, just like you mentioned.
Companion: Right, candidate generation is usually the bottleneck in hybrid
systems at your scale. Since you mentioned TensorFlow earlier, here are a few
approaches:
1. **ScaNN** (Google's ANN library) -- integrates well with TensorFlow and
handles your 2M product catalog easily. You'd generate candidates in ~10ms
and spend the rest of the budget on the ranking model.
2. **Two-tower architecture** -- train separate user and item embedding towers,
precompute item embeddings, and do ANN lookup at serving time. This moves
most computation offline.
3. **Caching** -- if your product catalog doesn't change frequently, cache the
top-K candidates per user segment and only run the full pipeline on cache
misses.
What does your current candidate generation look like?
The companion remembered the project (recommendation engine), the tech stack (Python, TensorFlow), the data scale (50M interactions, 2M products), the latency requirement (sub-100ms), and its own earlier suggestion about ANN indexes. All of this context was retrieved from CortexDB and injected into the GPT-4o prompt.
Session 3 (Two weeks later)
You: Update on the recommendation engine -- we went with ScaNN and got latency
down to 45ms. Ship date is April 1st.
Companion: That's a great result -- from 200ms down to 45ms is more than a 4x
improvement, well under your 100ms target. ScaNN was a good choice for your 2M
product catalog. April 1st ship date noted. Is there anything else you need to
prepare before launch, like A/B testing infrastructure or a fallback strategy?
Advanced: Preference Learning
The companion can track preferences by storing them as explicit episodes when detected:
def detect_and_store_preferences(user_message: str, assistant_response: str, user_id: str):
"""
Use the LLM to detect preferences expressed in the conversation
and store them explicitly.
"""
detection = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Analyze this conversation turn and extract any user
preferences expressed (communication style, interests, goals,
likes/dislikes, scheduling preferences, etc.).
Return a JSON array of preferences, or an empty array if none detected.
Format: [{"category": "...", "preference": "...", "confidence": 0.0-1.0}]"""
},
{
"role": "user",
"content": f"User: {user_message}\nAssistant: {assistant_response}"
}
],
temperature=0.0,
response_format={"type": "json_object"}
)
import json
result = json.loads(detection.choices[0].message.content)
preferences = result.get("preferences", [])
for pref in preferences:
if pref["confidence"] >= 0.7:
cortex.remember(
content=f"User preference: {pref['preference']}",
tenant_id=user_id,
metadata={
"source": "companion",
"category": pref["category"],
"confidence": pref["confidence"],
},
)
Over time, CortexDB accumulates a rich preference profile that the companion can query:
# Retrieve user preferences for a specific context
preferences = cortex.recall(
query="user communication and scheduling preferences",
tenant_id=user_id,
)
Advanced: Multi-Modal Memory
CortexDB episodes can store metadata pointing to external resources -- images, documents, code snippets -- alongside text descriptions:
# Store a memory about an image the user shared
cortex.remember(
content="User shared a photo of their whiteboard with the system architecture diagram. Shows three microservices: auth, payments, and recommendations. Auth talks to payments via gRPC. Recommendations has a Redis cache layer.",
tenant_id=user_id,
metadata={
"source": "companion",
"media_type": "image",
"media_url": "s3://companion-media/user-alice/whiteboard-2026-03-15.jpg",
"description": "System architecture whiteboard diagram",
},
)
When the user later asks "What did my architecture diagram look like?", the companion retrieves the episode and can describe the architecture from the stored content.
Advanced: Proactive Context
The most powerful feature of a memory-equipped companion is proactive context -- surfacing relevant information the user did not explicitly ask for:
def chat_with_proactive_context(user_message: str, user_id: str = USER_ID) -> str:
"""Enhanced chat that proactively surfaces relevant context."""
# Recall memories for the direct query
direct_memories = cortex.recall(
query=user_message,
tenant_id=user_id,
)
# Also recall potentially relevant context the user hasn't asked about
proactive_memories = cortex.recall(
query=f"upcoming deadlines, reminders, or important context related to: {user_message}",
tenant_id=user_id,
)
memory_context = format_memories(direct_memories)
proactive_context = format_proactive(proactive_memories)
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": SYSTEM_PROMPT + "\n\n" + memory_context + "\n\n" + proactive_context
},
{
"role": "user",
"content": user_message
}
],
temperature=0.7,
max_tokens=1024
)
assistant_message = response.choices[0].message.content
cortex.remember(
content=f"User: {user_message}\n\nAssistant: {assistant_message}",
tenant_id=user_id,
metadata={"source": "companion"},
)
return assistant_message
def format_proactive(memories) -> str:
"""Format proactive context hints for the LLM."""
if not memories.context:
return ""
return f"## Proactive Context (mention ONLY if directly relevant to the user's message)\n\n{memories.context}"
Now when the user says "I'm preparing for a meeting tomorrow", the companion can proactively mention: "By the way, you mentioned last week that you wanted to bring up the recommendation engine latency results in your next team meeting. The ScaNN integration got you down to 45ms -- that might be worth sharing."
Production Considerations
Tenant Isolation
Each user gets their own tenant_id. CortexDB provides tenant-level isolation, so one user's memories are never accessible to another.
Memory Pruning
Over months of use, the episode count grows. CortexDB handles this at query time by selecting the most relevant results. For storage management, you can use the forget API:
# Remove old memories (optional)
cortex.forget(
query="conversations older than 1 year",
tenant_id=user_id,
)
Latency Budget
The memory-augmented chat adds one network round-trip to CortexDB (typically 10-50ms) on top of the OpenAI API call. In practice, the CortexDB recall completes before the LLM starts generating tokens, so the perceived latency is dominated by the LLM response time.
Cost
CortexDB adds no LLM cost on the write path. The only additional LLM cost is the extra tokens in the prompt from memory context (~500-2,000 tokens per query depending on the amount of context). At GPT-4o pricing, this is approximately $0.005-$0.02 per query for the additional context.
Complete Example
The full working example is available as a cookbook:
pip install cortexdb
cd examples/ai-companion
pip install -r requirements.txt
python companion.py