Skip to content

Audio Format Migration Guide

Overview

Voice Mode now uses PCM audio format by default for TTS streaming. This change provides:

  • Zero encoding latency - No compression overhead for real-time streaming
  • Best streaming performance - Direct audio data without conversion
  • Maximum compatibility - Works with all audio systems
  • Instant playback - No decoding required

For STT uploads and audio saving, compressed formats like Opus are still available.

Important Note: While Opus was originally intended for streaming due to its low-latency design, in practice it requires full buffering before playback. PCM is the only format that truly supports progressive streaming for TTS.

Quick Start

For most users, no action is required. Voice Mode will automatically use PCM format for TTS streaming, providing the best real-time performance.

To Use Compressed Formats

If you prefer compressed formats (trading latency for smaller file sizes):

export VOICEMODE_TTS_AUDIO_FORMAT="opus"  # or mp3, aac, etc.

Or add to your MCP configuration:

{
  "mcpServers": {
    "voice-mode": {
      "command": "uvx",
      "args": ["voice-mode"],
      "env": {
        "OPENAI_API_KEY": "your-key",
        "VOICEMODE_TTS_AUDIO_FORMAT": "opus"
      }
    }
  }
}

Configuration Options

Basic Configuration

# Set default format for all operations
export VOICEMODE_AUDIO_FORMAT="pcm"  # Options: pcm, opus, mp3, wav, flac, aac

# PCM is default for TTS streaming (best performance)
export VOICEMODE_TTS_AUDIO_FORMAT="pcm"

Advanced Configuration

# Different formats for TTS and STT
export VOICEMODE_TTS_AUDIO_FORMAT="pcm"    # For text-to-speech (default)
export VOICEMODE_STT_AUDIO_FORMAT="opus"   # For speech-to-text upload

# Quality settings (for compressed formats)
export VOICEMODE_OPUS_BITRATE="32000"      # Opus bitrate (default: 32kbps)
export VOICEMODE_MP3_BITRATE="64k"         # MP3 bitrate (default: 64k)
export VOICEMODE_AAC_BITRATE="64k"         # AAC bitrate (default: 64k)

Provider Compatibility

Voice Mode automatically validates format compatibility with your providers:

Provider TTS Formats STT Formats
OpenAI opus, mp3, aac, flac, wav, pcm mp3, opus, wav, flac, m4a, webm
Kokoro (local) mp3, wav N/A
Whisper.cpp (local) N/A wav, mp3, opus, flac, m4a

If you select an unsupported format, Voice Mode will automatically fallback to a compatible format.

Migration from Existing Setup

Checking Your Current Setup

If you have existing audio files saved with VOICEMODE_SAVE_AUDIO=true, they are likely in MP3 or Opus format. You can check:

ls ~/voicemode_audio/

Gradual Migration

You can run multiple formats side-by-side:

  1. Keep existing compressed audio files
  2. TTS streaming uses PCM for best performance
  3. STT uploads can use compressed formats
  4. All formats work seamlessly together

Converting Existing Files

To convert existing MP3 files to Opus (optional):

# Using ffmpeg
for file in ~/voicemode_audio/*.mp3; do
    ffmpeg -i "$file" -c:a libopus -b:a 32k "${file%.mp3}.opus"
done

Troubleshooting

Issue: "Provider doesn't support format"

Voice Mode will automatically fallback to a supported format. You'll see a log message like:

Format 'opus' not supported by kokoro, using 'mp3' instead

Note: PCM is universally supported for streaming.

Issue: "Audio playback issues"

Some older systems might have issues with Opus playback. Try:

  1. Update your audio libraries:

    # Ubuntu/Debian
    sudo apt update && sudo apt install libopus0 libopusfile0
    
    # macOS
    brew install opus opus-tools
    

  2. Or switch to a compressed format:

    export VOICEMODE_TTS_AUDIO_FORMAT="mp3"
    

Issue: "Larger file sizes than expected"

Opus files might appear larger if saved in an OGG container. The actual audio data is still compressed efficiently.

Format Comparison

Format File Size* Quality Latency Best For
PCM N/A (streaming) Uncompressed Zero TTS streaming (default)
Opus Smallest (100KB) Excellent for voice High (buffering required) STT uploads, saving
MP3 Medium (500KB) Good Low Wide compatibility
AAC Medium (450KB) Good Low Apple ecosystem
FLAC Large (2MB) Lossless Low Archival
WAV Largest (5MB) Uncompressed Zero Local processing

*Approximate sizes for 1 minute of speech

Benefits of PCM for Streaming

  1. Zero Latency: No encoding/decoding overhead
  2. Best Performance: Direct audio playback
  3. Universal Support: Works on all systems
  4. Streaming Optimized: No buffering for format conversion
  5. Real-time Ready: Perfect for live conversations

Benefits of Opus for Uploads

  1. Bandwidth Efficiency: Crucial for cloud API calls
  2. Small File Size: 50-80% smaller than MP3
  3. Voice Optimized: Designed for speech
  4. Wide Platform Support: Works on modern systems
  5. Future-proof: Active development

Changing Default Formats

To change from PCM streaming to compressed formats:

  1. Set environment variables:

    # For TTS streaming (consider latency impact)
    export VOICEMODE_TTS_AUDIO_FORMAT="opus"
    
    # For STT uploads (already uses compression by default)
    export VOICEMODE_STT_AUDIO_FORMAT="mp3"
    

  2. Or update your MCP configuration as shown above

  3. Restart your MCP client

PCM provides the best streaming performance, but compressed formats are useful for: - Reducing bandwidth usage - Saving audio files - STT uploads to cloud services