Reducing LLM Latency: How to Build <200ms Voice Agents (Python Guide)

In text chat, a 2-second delay is acceptable. In Voice AI, a 2-second delay is a disaster.

If your AI agent takes 2 seconds to respond on a phone call, the user assumes the line is dead, or they start talking over the bot. To feel “human,” your Voice Agent must achieve a Time to First Token (TTFT) of under 500ms (ideally <200ms).

Achieving this requires stripping away every millisecond of network and compute overhead.

At The AI Division, we specialize in High-Performance Voice Agent Development (Link to your Service Page). We have optimized pipelines for call centers that handle thousands of concurrent calls with near-zero latency.

Here is the engineering architecture we use to break the speed barrier.

The 4 Killers of Latency

Before fixing it, you must measure it. Latency comes from four sources:

Transcription (STT): Converting user audio to text.
Inference (LLM): The model thinking.
Synthesis (TTS): Converting text back to audio.
Network: The trip between servers.

We will tackle the biggest bottleneck: Inference & Network.

Step 1: Switch to WebSockets (No REST APIs)

If you are using standard HTTP POST requests to OpenAI, you are already losing 300ms+ in handshake overhead per turn. You must use WebSockets (or HTTP/2 streaming).

Here is a Python Asyncio skeleton for a persistent, bi-directional audio stream.

import asyncio
import websockets
import json

async def voice_stream_handler():
    uri = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": "Bearer YOUR_API_KEY",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(uri, extra_headers=headers) as websocket:
        print("Connected to Realtime API")
        
        # Send audio chunk
        await websocket.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": "BASE64_ENCODED_AUDIO_CHUNK"
        }))

        # Receive stream instantly
        async for message in websocket:
            response = json.loads(message)
            if response['type'] == 'response.audio.delta':
                # Play audio bytes immediately (don't wait for full sentence)
                stream_audio(response['delta'])

# Run the async loop
asyncio.run(voice_stream_handler())

Pro Tip: Managing WebSocket stability at scale is difficult. If your calls are dropping, check out our Enterprise AI Infrastructure Services and connect with us today to fix it.

Step 2: The “Groq” Factor (LPU vs GPU)

For text generation, GPT-4o is smart, but it can be variable in speed.
If your agent doesn’t need “Einstein-level” reasoning (e.g., a simple appointment setter), switch the inference engine to Groq.

Groq uses LPUs (Language Processing Units) instead of GPUs.

GPT-4o: ~40-60 tokens/sec.
Groq (Llama 3-8b): ~800 tokens/sec.

This creates an instant response feel.

from groq import Groq
import time

client = Groq(api_key="YOUR_GROQ_KEY")

start = time.time()
stream = client.chat.completions.create(
    messages=[{"role": "user", "content": "Book an appointment for Tuesday."}],
    model="llama3-8b-8192",
    stream=True, # Always Stream!
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        # First token usually arrives in <150ms
        print(chunk.choices[0].delta.content, end="")
        
print(f"\nTotal Latency: {time.time() - start}s")

Step 3: Semantic Caching (The 0ms Response)

The fastest way to generate an answer is not to generate it at all.

Many voice queries are repetitive (“What are your hours?”, “Are you open?”). By implementing Semantic Caching, we store the vector of the question. If a user asks a similar question, we return the pre-cached audio instantly.

This reduces latency from ~500ms to ~20ms.

from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()

# If this question was asked before, the audio is returned instantly
# No LLM inference happens
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',
    messages=[{'role': 'user', 'content': 'What is your refund policy?'}]
)

Real World Results

We recently implemented a similar voice agent for a Insurance company (Read Case Study).

Before: 2.5s delay. Patients kept asking “Hello? Are you there?”
After: 0.3s delay. Patients spoke naturally, often unaware they were talking to an AI.

Conclusion

To get under 200ms, you must optimize the entire pipeline:

Use WebSockets (not HTTP).
Stream Tokens immediately.
Use Groq for speed or GPT-4o Realtime for smarts.
Cache common answers.

Speed is the most important feature of Voice AI. If it’s slow, it’s broken.

Don’t Want to Code This Yourself?

Building low-latency voice pipelines requires complex engineering. The AI Division provides done-for-you Voice Agent development. We handle the WebSockets, the LLM optimization, and the Telephony integration (Twilio/Vonage) so you can focus on your business.

Explore Our Voice AI Services
Click here to know more about our AI Voice Agent implementation service.

The 4 Killers of Latency

Step 1: Switch to WebSockets (No REST APIs)

Step 2: The “Groq” Factor (LPU vs GPU)

Step 3: Semantic Caching (The 0ms Response)

Real World Results

Conclusion

Don’t Want to Code This Yourself?

Leave a Comment (Cancel reply)

Recent Posts

Archive

Tags

AI Strategy and Consulting

Company

Services