Back to KB
Difficulty
Intermediate
Read Time
6 min

Building a Real-Time AI Voice Agent for Asterisk

By Codcompass TeamΒ·Β·6 min read

Current Situation Analysis

Every missed phone call represents lost revenue, particularly for time-sensitive home services (plumbers, electricians, locksmiths). Human agents are costly, require shift coverage, and inevitably miss off-hours calls. Traditional IVR systems force callers through rigid menus, degrading user experience and increasing abandonment rates.

The core failure mode of legacy voice AI stacks is latency. Callers expect conversational turn-taking; a 1–2 second pause after every utterance breaks immersion and triggers hang-ups. Traditional cloud providers fail in real-time telephony contexts for three reasons:

  1. Synchronous Processing Pipelines: Most STT/LLM/TTS chains wait for complete sentences before processing, adding 300–500ms of accumulation delay.
  2. Inference Bottlenecks: General-purpose LLMs (e.g., GPT-4) exhibit 500ms–2s Time-To-First-Token (TTFT), making real-time dialogue impossible.
  3. Audio Format Mismatch: High-fidelity TTS engines output 24kHz/44.1kHz audio, requiring CPU-intensive resampling to 8kHz for telephony, adding 50–100ms of processing overhead and breaking streaming continuity.

Without a concurrent, token-streaming architecture that aligns STT endpointing, speculative LLM decoding, and native telephony PCM output, sub-250ms mouth-to-ear latency remains unachievable.

WOW Moment: Key Findings

ApproachMouth-to-Ear LatencyTTFT (LLM)TTFB (TTS)Audio Format HandlingConcurrency Model
Traditional Cloud Stack (Big 3 STT + GPT-4 + ElevenLabs)1,200–2,000ms500ms–2,000ms300–400msRequires 24kHzβ†’8kHz resamplingSynchronous/Blocking
Optimized Stack (Deepgram Nova-3 + Groq Specdec + Cartesia Sonic-3)200–250ms30–50ms50–80msNative 8kHz PCM (pcm_s16le)Concurrent Token-Streaming

Key Findings:

  • Token-Streaming Breakthrough: Piping LLM tokens directly into TTS as they arrive reduces perceived latency by ~85%. The caller hears the response while the AI is still generating it.
  • Speculative Decoding Advantage: Groq's specdec variant delivers 1,665 tokens/second (6x throughput over standard variants) with identical output quality, making 70B parameter models viable for real-time voice.
  • Native Telephony Audio: Cartesia's native 8kHz PCM output eliminates resampling overhead and enables true WebSocket continuation streaming, critical for maintaining AudioSocket frame pacing.
  • Latency Budget Alignment: Deepgram STT (~150-200ms) + Groq TTFT (~30-50ms) + Cartesia TTFB (~50-80ms) + AudioSocket transmission (~20ms) = ~200-250ms total, consistently beating human agent response times.

Core Solution

Architecture & Data Flow

                          TELEPHONE NETWORK
                                |
                           SIP Trunk
                                |
                    +

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back