-----------v-----------+
| ASTERISK PBX |
| |
| 1. Answer() |
| 2. AGI(setup.agi) |
| - Generate UUID |
| - Write metadata |
| 3. AudioSocket( |
| 127.0.0.1:9099) |
+-----------+------------+
|
TCP (AudioSocket Protocol)
8kHz 16-bit PCM, 20ms frames
|
+-----------v------------+
| PYTHON VOICE AGENT |
| (asyncio TCP server) |
| |
| +-------------------+ | +------------------+
Caller | | Audio Reader | | | Deepgram Nova-3 |
speaks | | - Read PCM frames |--------> Streaming STT |
| | - Barge-in VAD | | | (WebSocket) |
| +-------------------+ | +--------+---------+
| | |
| | Transcript
| | |
| +-------------------+ | +--------v---------+
| | Conversation | | | Groq Llama 3.3 |
| | Manager |<------| 70B specdec |
| | - State machine | | | Streaming LLM |
| | - Message history | | | (HTTP SSE) |
| | - Tool calls | | +--------+---------+
| +-------------------+ | |
| | Token stream
| | |
| +-------------------+ | +--------v---------+
Caller | | Audio Writer | | | Cartesia Sonic-3 |
hears <-----| - Queue playback |<------| Streaming TTS |
| | - 20ms pacing | | | (WebSocket) |
| +-------------------+ | +------------------+
| |
+-------------------------+
Latency Budget (target: <250ms mouth-to-ear):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Deepgram STT final transcript: ~150-200ms β
β Groq LLM first token (TTFT): ~30-50ms β
β Cartesia TTS first audio (TTFB): ~50-80ms β
β AudioSocket frame transmission: ~20ms (1 frame) β
β β
β TOTAL: ~200-250ms β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
**Data Flow Summary:**
1. Caller audio arrives at Asterisk as RTP, converted to raw PCM via AudioSocket.
2. Python agent streams PCM frames to Deepgram Nova-3 WebSocket.
3. Deepgram returns fragments; `speech_final` triggers complete utterance processing.
4. Conversation history + utterance sent to Groq Llama 3.3 70B with streaming enabled.
5. **Concurrent Token-Streaming**: Each LLM token immediately routes to Cartesia Sonic-3 continuation API.
6. Cartesia returns PCM chunks, paced at 20ms intervals, written back through AudioSocket to Asterisk.
7. Asterisk converts PCM to RTP and delivers to caller.
### Component Selection Rationale
- **STT (Deepgram Nova-3)**: Purpose-built for real-time streaming. `endpointing` controls silence detection, `speech_final` marks utterance completion, and keyword boosting (`keywords=postcode:2`) improves domain accuracy. British English model (`language=en-GB`) handles regional accents reliably.
- **LLM (Groq Llama 3.3 70B specdec)**: Speculative decoding uses a draft model to predict tokens, verified in parallel by the 70B model. Delivers 1,665 tok/s with 30-50ms TTFT, outperforming standard variants by 6x while maintaining large-model reasoning for objection handling and workflow progression.
- **TTS (Cartesia Sonic-3)**: Native 8kHz PCM (`pcm_s16le`) eliminates resampling. Continuation API enables true token-streaming over a single WebSocket context, maintaining audio continuity without context-switching overhead.
### Implementation Prerequisites & Verification
```bash
pip install websockets aiohttp cartesia
asterisk -rx "module show like audiosocket"
Expected output:
Module Description Use Count Status
res_audiosocket.so AudioSocket support 0 Running
app_audiosocket.so AudioSocket application 0 Running
2 modules loaded
If modules are missing:
asterisk -rx "module load res_audiosocket.so"
asterisk -rx "module load app_audiosocket.so"
Persist in /etc/asterisk/modules.conf:
load = res_audiosocket.so
load = app_audiosocket.so
Pipeline & State Management
- Barge-In Handling: VAD monitors incoming PCM for energy spikes during TTS playback. On detection, the audio writer queue is flushed, Cartesia context is terminated, and Groq receives an interruption signal to reset generation.
- Conversation State Machine: 8-step workflow (greet β understand β quote β collect details β book β confirm β close). State transitions are triggered by STT confidence thresholds and tool call completions.
- Tool Calling Integration: Groq outputs structured JSON for external booking APIs. The agent parses tool calls, executes
aiohttp requests, and injects results back into the conversation context without breaking the streaming pipeline.
- DID-to-Company Context API: Inbound DID routing queries a lightweight context API to inject company-specific greetings, pricing tiers, and service areas into the system prompt dynamically.
Latency Optimization Deep Dive
- Frame Pacing: AudioSocket requires strict 20ms PCM frame transmission. Bursting causes Asterisk buffer underruns. The audio writer uses a token bucket algorithm to maintain exact pacing.
- Asyncio Event Loop: All I/O (WebSocket STT/TTS, HTTP LLM, TCP AudioSocket) runs non-blocking. CPU-bound tasks are offloaded to thread pools to prevent loop starvation.
- Endpointing Tuning: Deepgram
endpointing set to 400ms for conversational pacing. Aggressive values (<200ms) cause false cuts; lenient values (>600ms) add dead air.
Pitfall Guide
- Ignoring Barge-In/VAD Synchronization: Failing to flush the TTS queue and terminate the Cartesia context on interruption causes the agent to talk over the caller. Implement energy-based VAD with immediate context teardown.
- Synchronous Token Processing: Waiting for full LLM sentences before invoking TTS destroys the latency budget. Must stream tokens concurrently via Cartesia's continuation API.
- Audio Resampling Overhead: Using 24kHz/44.1kHz TTS requires CPU-intensive resampling to 8kHz for telephony, adding 50β100ms latency and breaking streaming continuity. Always use native 8kHz PCM outputs.
- Misconfigured STT Endpointing: Too aggressive (
endpointing < 200ms) cuts off speakers mid-sentence; too lenient (> 600ms) adds silence latency. Tune based on conversational domain and monitor speech_final triggers.
- Blocking the Asyncio Event Loop: Synchronous HTTP calls or heavy CPU tasks in the main loop will drop AudioSocket frames, causing Asterisk to hang up. Use
aiohttp, asyncio.gather, and offload CPU work to executors.
- Neglecting Network Jitter & Frame Pacing: AudioSocket requires strict 20ms frame pacing. Bursting frames causes buffer underruns/overruns. Implement a pacing queue with token bucket rate limiting.
- State Machine Deadlocks: Failing to handle partial transcripts or tool call failures leaves the conversation stuck. Implement timeout fallbacks, confidence thresholds, and explicit state reset handlers.
Deliverables
- π System Blueprint: Complete architecture diagram, data flow specification, latency budget breakdown, and component interaction matrix.
- β
Production Checklist: Pre-deployment verification steps including Asterisk module validation, API credential testing, barge-in stress testing, latency benchmarking (<250ms target), and fallback routing configuration.
- βοΈ Configuration Templates:
modules.conf & extensions.conf snippets for AudioSocket routing and AGI bootstrapping
systemd service unit for production daemonization with automatic restart and resource limits
requirements.txt and environment variable schema for API keys, endpoint URLs, and latency tuning parameters
- Python asyncio TCP server skeleton with AudioSocket frame reader/writer, Deepgram/Groq/Cartesia WebSocket clients, and state machine integration points