How to Decrease Latency in Text to Speech APIs Practical tips on How to Decrease Latency in Text to Speech APIs and a few creative ones as well.

October 5, 2024 8 min read

Low latency, highest quality text to speech API

Free API Playground

In the world of text-to-speech (TTS), latency is king. Whether you’re building a real-time voice assistant or a transcription service, having a low-latency TTS system can make or break your user experience. Let’s explore how to test TTS API latency, optimize it for faster response times, and get creative with solutions to cut down those milliseconds.

Why Latency is Important in TTS Systems

Latency, the delay between requesting a text-to-speech (TTS) response and receiving the audio output, is crucial for delivering smooth, real-time experiences in applications like voice assistants, transcription services, and speech synthesis. Low latency enhances the user experience, especially when interacting with systems driven by large language models (LLMs) like ChatGPT, OpenAI, and Microsoft’s Speech Service. Whether you’re using ElevenLabs, Google Speech, or Deepgram, cutting down latency ensures that applications can respond quickly to user inputs, whether they’re engaging in conversation or converting speech to text (via STT).

For developers integrating TTS APIs into their applications, whether using SDKs from OpenAI, Microsoft, or ElevenLabs, low latency means faster transcription and quicker audio playback, which is especially important for real-time LLM interactions and live voice applications. The faster the response times, the more natural the application will feel to end-users, whether it’s a voice assistant, IVR system, or content creation tool.

Testing Latency in PlayHT’s TTS API

Before you can optimize, you need to measure. Latency in PlayHT (or any TTS provider) typically falls into three main categories:

Network Latency: Time taken for the API request to reach the endpoint and for the response to come back.
Processing Latency: The time it takes for the PlayHT engine to synthesize the requested speech.
Audio Playback Latency: Delays related to downloading the audio file and starting playback.

Measuring Latency in Python

Using Python, we can easily measure the latency when interacting with PlayHT’s API. Here’s a simple script that measures the time it takes to request audio from PlayHT and receive a response.

import time
import requests

api_url = "https://api.playht.com/v1/tts"  # PlayHT TTS API endpoint
headers = {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
}

data = {
    "text": "Hello, PlayHT is amazing at low-latency TTS!",
    "voice": "en-us",  # English, US accent
    "audio_format": "wav"  # You can also use 'mp3' or 'opus' for smaller files
}

start_time = time.time()  # Start time
response = requests.post(api_url, headers=headers, json=data)
response_time = time.time() - start_time  # Total response time in seconds
print(f"Response time: {response_time * 1000:.2f} ms")  # Convert to milliseconds

audio_url = response.json()['audio_url']  # Assuming PlayHT returns the audio URL

Here’s what’s happening:

response_time measures the full round-trip time to get the audio file URL from the PlayHT API.
Downloading and playing the audio introduces extra latency, but this helps you benchmark the core API performance.

Streaming Audio with WebSockets for Real-Time Playback

If your application requires real-time responses, streaming with WebSockets is a great option. PlayHT supports WebSocket-based streaming for immediate audio chunk delivery as soon as they’re ready, improving perceived response times.

Here’s how you can stream audio chunks from PlayHT using async Python with websockets:

import asyncio
import websockets
import json

async def stream_audio(text):
    uri = "wss://playht-websocket-endpoint"  # PlayHT WebSocket streaming endpoint
    async with websockets.connect(uri) as websocket:
        await websocket.send(json.dumps({
            "text": text,
            "voice": "en-us",
            "audio_format": "opus"  # Opus for smaller, more efficient streaming
        }))

        # Streaming audio chunks
        while True:
            message = await websocket.recv()
            audio_chunk = json.loads(message)['audio_chunk']
            # Here you can process or play the audio_chunk in real-time

asyncio.run(stream_audio("Real-time TTS with PlayHT!"))

By using WebSockets, you start playback while the TTS engine is still working, cutting down the waiting time dramatically, especially useful for real-time applications like chatbots or interactive voice response (IVR) systems.

Optimizing TTS Latency with PlayHT

Let’s now dive into practical strategies for reducing latency when using PlayHT’s TTS.

1. Choose the Right Audio Format

The format of your audio file matters for performance. Large, uncompressed formats like WAV offer high quality but introduce network latency due to their size. Opt for Opus or MP3 to reduce bandwidth requirements without sacrificing too much quality.

Here’s an example of requesting Opus audio from PlayHT:

data = {
    "text": "Using the Opus format can reduce latency significantly.",
    "voice": "en-us",
    "audio_format": "opus"  # Smaller file size for quicker streaming
}

Smaller files mean faster transmission and low-latency audio playback.

2. Reduce Sample Rate

If you don’t need studio-quality audio, lowering the sample rate (e.g., from 48kHz to 16kHz) can speed up processing and reduce file sizes, further improving latency.

data = {
    "text": "Lowering the sample rate speeds things up.",
    "voice": "en-us",
    "audio_format": "wav",
    "sample_rate": 16000  # Reducing the sample rate for faster delivery
}

This can be a great option for use cases like voice prompts or casual conversations where high fidelity isn’t required.

3. Implement Async Requests

Rather than waiting for the entire audio file to finish generating, you can implement async requests to handle transcription and speech synthesis in parallel. Python’s asyncio module can handle this:

import aiohttp
import asyncio

async def get_audio_async(text):
    api_url = "https://api.playht.com/v1/tts"
    headers = {
        'Authorization': 'Bearer YOUR_API_KEY',
        'Content-Type': 'application/json'
    }

    async with aiohttp.ClientSession() as session:
        async with session.post(api_url, headers=headers, json={"text": text, "voice": "en-us", "audio_format": "opus"}) as response:
            audio_data = await response.json()
            print(f"Audio file URL: {audio_data['audio_url']}")

asyncio.run(get_audio_async("Async TTS to reduce latency!"))

With async code, you can trigger other operations (like playback or UI updates) while waiting for the TTS response. This is especially important in real-time applications like virtual assistants or transcription services.

4. Minimize Network Latency

Using PlayHT’s global infrastructure, ensure that your requests are routed to the nearest server region to reduce network latency. Additionally, consider multi-region deployments if your application needs to serve users worldwide, especially in high-speed or mission-critical environments.

5. Optimize API Payloads

When making requests to PlayHT, send only the essential data. Avoid excessive or redundant metadata and unnecessary audio formats that could slow down processing and increase the size of the response.

6. Cache Repeated Transcriptions

For repeated text-to-speech conversions, caching the audio data or audio files can save valuable processing time. If the same text is requested multiple times, simply return the pre-cached audio file instead of hitting the API again.

Creative Idea: Use Local Models to Bypass Latency

For use cases where minimizing latency is absolutely critical (like in real-time gaming or VR), consider running lightweight TTS models locally on the client side. While PlayHT excels in cloud-based TTS, using open-source models like VITS or FastSpeech 2 alongside PlayHT as a fallback can provide true real-time synthesis.

Here’s an example of running a local TTS model in Python:

from TTS.api import TTS

# Load a pre-trained model
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC", progress_bar=False)

# Generate audio
tts.tts_to_file(text="Local TTS with zero network latency!", file_path="output.wav")

This approach can completely bypass network and processing latency, providing instant playback for specific use cases.

When it comes to optimizing latency in PlayHT or any TTS service, it’s essential to break down the different stages—network, processing, and playback—to identify bottlenecks. By testing thoroughly with Python, optimizing audio formats and sample rates, using async methods, and leveraging caching, you can significantly reduce response times.

For real-time applications, integrating WebSocket streaming or even client-side TTS inference might be necessary to achieve near-instant responses.

Be sure to check out PlayHT’s API docs and their GitHub samples to get started with low-latency TTS for your project!

How can API latency be reduced?

API latency can be reduced by optimizing network performance (using closer servers), reducing audio file size (like using Opus instead of WAV), and selecting lower sample rates. Employing async or WebSocket connections can also help deliver audio chunks faster.

Does OpenAI TTS have low latency?

Yes, OpenAI TTS offers competitive latency, typically in the range of 1–3 seconds, depending on the network latencyand audio processing requirements. However, further optimizations like caching or real-time WebSocket streaming can reduce the delay.

How to decrease latency?

To decrease latency in TTS APIs, minimize network latency by choosing the nearest server region, reduce the audio format size, and implement async processing for faster response times. Using WebSocket streaming for real-time applications is also highly effective.

What is the latency of Deepgram API?

The latency of Deepgram API is typically very low, ranging from 200ms to 500ms, making it an excellent choice for real-time speech-to-text (STT) and transcription applications.