gpt-realtime-1.5` model handles simultaneous STT and TTS with built-in voice activity detection (VAD). This eliminates the need for separate Whisper/TTS pipelines and reduces round-trip latency by processing audio chunks incrementally.
3. Knowledge Grounding (Supermemory): Technical documentation requires precision. Supermemory provides a managed RAG pipeline that supports hybrid search (semantic embeddings + BM25 keyword matching) and cross-encoder reranking. This combination drastically reduces hallucination rates on proprietary APIs and internal conventions.
4. Visual Presence (Anam): Voice-only agents suffer from "presence ambiguity" in group calls. Anam renders a real-time animated avatar with lip-sync and gesture mapping, converting audio output into synchronized video streams that integrate naturally into existing meeting workflows.
Implementation Structure
The following implementation replaces functional agent registration with a class-based orchestrator pattern. This improves testability, enables explicit lifecycle hooks, and isolates RAG function registration from transport logic.
import os
import asyncio
from typing import Optional
from dataclasses import dataclass
from dotenv import load_dotenv
from vision_agents.core import Agent, AgentLauncher, Runner
from vision_agents.core.edge.types import User
from vision_agents.plugins import getstream, openai
from vision_agents.plugins.anam import AnamAvatarPublisher
load_dotenv()
@dataclass
class AgentConfig:
model_id: str = "gpt-realtime-1.5"
voice_profile: str = "ash"
system_prompt: str = (
"You are a technical documentation assistant. "
"Answer questions using only retrieved context. "
"If information is unavailable, state that clearly."
)
class KnowledgeVoiceAgent:
def __init__(self, config: Optional[AgentConfig] = None):
self.config = config or AgentConfig()
self._agent: Optional[Agent] = None
async def _initialize_transport(self) -> Agent:
"""Configure WebRTC edge, speech model, and avatar processor."""
return Agent(
edge=getstream.Edge(),
agent_user=User(name="TechDocs Voice", id="voice-assistant"),
instructions=self.config.system_prompt,
llm=openai.Realtime(
model=self.config.model_id,
voice=self.config.voice_profile
),
processors=[AnamAvatarPublisher()]
)
async def _register_knowledge_function(self, agent: Agent) -> None:
"""Attach RAG retrieval capability to the Realtime session."""
@agent.function()
async def retrieve_documentation(query: str) -> dict:
"""Search indexed documentation and return top relevant passages."""
# Supermemory API call would be injected here
# In production, use async HTTP client with retry/backoff
return {
"status": "success",
"context_chunks": await self._query_memory_store(query),
"source_count": 3
}
async def _query_memory_store(self, query: str) -> list[str]:
"""Placeholder for Supermemory hybrid search + reranking pipeline."""
# Production implementation:
# 1. Encode query with embedding model
# 2. Execute BM25 + vector search
# 3. Apply Reciprocal Rank Fusion
# 4. Rerank with cross-encoder (rerank=True)
# 5. Return top-k chunks
return ["Retrieved technical passage 1", "Retrieved technical passage 2"]
async def create_agent(self, **kwargs) -> Agent:
"""Factory method for Vision Agents launcher."""
self._agent = await self._initialize_transport()
await self._register_knowledge_function(self._agent)
return self._agent
async def join_session(self, agent: Agent, session_type: str, session_id: str, **kwargs) -> None:
"""Manage WebRTC session lifecycle with automatic cleanup."""
session = await agent.create_call(session_type, session_id)
async with agent.join(session):
await agent.finish()
if __name__ == "__main__":
launcher = AgentLauncher(
create_agent=KnowledgeVoiceAgent().create_agent,
join_call=KnowledgeVoiceAgent().join_session
)
Runner(launcher).cli()
Why This Structure Works
- Explicit Lifecycle Management: The
async with agent.join(session) pattern guarantees WebRTC channel teardown, preventing orphaned media streams that consume edge resources.
- Function Isolation: RAG retrieval is registered as a discrete tool rather than baked into the system prompt. This allows the Realtime API to route queries deterministically, reducing token waste and improving response grounding.
- Processor Pipeline: Anam's avatar publisher operates as a stream processor, intercepting audio output, generating synchronized video frames, and republishing to the Stream room. This decouples visual rendering from speech synthesis.
- Configuration Separation:
AgentConfig centralizes model selection, voice profiles, and prompt templates, enabling environment-specific overrides without code changes.
Pitfall Guide
1. Ignoring WebRTC NAT Traversal Requirements
Explanation: WebRTC relies on ICE candidates for peer connectivity. Without proper TURN server configuration, agents will fail to join sessions behind corporate firewalls or carrier-grade NAT.
Fix: Stream provides managed TURN routing, but verify fallback chains. Implement explicit iceTransportPolicy: "relay" in production deployments to guarantee connectivity.
2. Relying Solely on Semantic Search for Technical Docs
Explanation: Embedding models struggle with exact matches for function names, error codes, and configuration keys. Pure vector search returns semantically similar but technically irrelevant chunks.
Fix: Implement hybrid search combining BM25 keyword matching with dense embeddings. Fuse results using Reciprocal Rank Fusion (RRF) to balance precision and recall.
3. Skipping Cross-Encoder Re-ranking
Explanation: Initial retrieval often returns 10β20 candidate chunks. Feeding all of them to the LLM wastes context window and introduces noise.
Fix: Always apply a cross-encoder reranker as a second-stage filter. Supermemory's rerank=True flag enables this automatically. Limit final context to top-3 passages to maintain latency budgets.
4. Unbounded Audio Context Windows
Explanation: Real-time voice agents accumulate conversation history. Without truncation, context windows overflow, causing latency spikes and degraded response quality.
Fix: Implement a sliding window with VAD-based segmentation. Retain only the last 3β5 turns plus retrieved RAG context. Flush stale metadata after session idle timeouts.
5. Hardcoding Knowledge Sources
Explanation: Documentation changes frequently. Static indexes become stale within days, causing the agent to return outdated deployment steps or deprecated API signatures.
Fix: Build a webhook-driven indexing pipeline. Trigger Supermemory re-ingestion on Git commits, CMS updates, or scheduled crawls. Maintain versioned knowledge bases for rollback capability.
6. Violating Latency Budgets with Heavy Pre-processing
Explanation: Running complex audio normalization, custom STT models, or synchronous HTTP calls before routing to the Realtime API breaks the sub-second interaction loop.
Fix: Stream audio chunks directly to OpenAI's Realtime endpoint. Perform RAG queries asynchronously in parallel with speech generation. Use non-blocking I/O and connection pooling.
7. Avatar Desynchronization
Explanation: Lip-sync and gesture rendering require strict audio-to-video frame alignment. Buffer underruns or variable network jitter cause visible desync, breaking user trust.
Fix: Maintain a 200β300ms audio buffer before feeding to the avatar processor. Monitor jitter metrics and implement adaptive playout delay. Anam's processor handles most alignment, but verify frame rate consistency.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Internal engineering team < 50 | Real-Time Voice RAG with Anam avatar | Reduces SME interruptions, scales onboarding, maintains flow state | Medium (API + WebRTC egress) |
| Public customer support portal | Async Text RAG with voice fallback | Lower latency tolerance, higher volume, cost-sensitive | Low (text-only RAG) |
| Compliance-heavy documentation | Hybrid Search + Strict Reranking + Audit Logging | Prevents hallucination on regulated content, enables traceability | High (reranker compute + storage) |
| Multi-language technical docs | OpenAI Realtime + Language-Specific Embeddings | Preserves technical accuracy across locales, reduces translation drift | Medium-High (multilingual model costs) |
Configuration Template
# .env.production
# Stream WebRTC Transport
STREAM_API_KEY="sk_live_..."
STREAM_API_SECRET="..."
# OpenAI Realtime Speech I/O
OPENAI_API_KEY="sk-proj-..."
OPENAI_REALTIME_MODEL="gpt-realtime-1.5"
OPENAI_VOICE_PROFILE="ash"
# Anam Avatar Rendering
ANAM_API_KEY="..."
ANAM_AVATAR_ID="..."
# Supermemory RAG Pipeline
SUPERMEMORY_API_KEY="..."
SUPERMEMORY_INDEX_ID="prod-docs-v2"
SUPERMEMORY_RERANK_ENABLED="true"
SUPERMEMORY_TOP_K="3"
# Agent Runtime
AGENT_SESSION_TIMEOUT="300"
AGENT_CONTEXT_WINDOW_TURNS="5"
LOG_LEVEL="INFO"
Quick Start Guide
- Initialize Project: Create a Python 3.10+ environment and install dependencies:
uv add "vision-agents[anam,getstream,openai]" python-dotenv supermemory
- Configure Credentials: Populate
.env with Stream, OpenAI, Anam, and Supermemory keys. Ensure your OpenAI org has Realtime API access enabled.
- Index Documentation: Upload your API references, runbooks, and internal wikis to Supermemory. Enable hybrid search and reranking in the dashboard.
- Launch Agent: Run
python agent.py run --session-type group --session-id dev-standup. The agent joins the WebRTC room, renders the Anam avatar, and begins listening for voice queries.
- Validate Pipeline: Ask a technical question. Verify that the agent retrieves chunks from Supermemory, reranks them, and responds with grounded audio within 800ms. Monitor latency metrics and adjust context window settings as needed.