Audio Format Migration Guide¶

Overview¶

Voice Mode now uses PCM audio format by default for TTS streaming. This change provides:

Zero encoding latency - No compression overhead for real-time streaming
Best streaming performance - Direct audio data without conversion
Maximum compatibility - Works with all audio systems
Instant playback - No decoding required

For STT uploads and audio saving, compressed formats like Opus are still available.

Important Note: While Opus was originally intended for streaming due to its low-latency design, in practice it requires full buffering before playback. PCM is the only format that truly supports progressive streaming for TTS.

Quick Start¶

For most users, no action is required. Voice Mode will automatically use PCM format for TTS streaming, providing the best real-time performance.

To Use Compressed Formats¶

If you prefer compressed formats (trading latency for smaller file sizes):

export VOICEMODE_TTS_AUDIO_FORMAT="opus"  # or mp3, aac, etc.

Or add to your MCP configuration:

{
  "mcpServers": {
    "voice-mode": {
      "command": "uvx",
      "args": ["voice-mode"],
      "env": {
        "OPENAI_API_KEY": "your-key",
        "VOICEMODE_TTS_AUDIO_FORMAT": "opus"
      }
    }
  }
}

Configuration Options¶

Basic Configuration¶

# Set default format for all operations
export VOICEMODE_AUDIO_FORMAT="pcm"  # Options: pcm, opus, mp3, wav, flac, aac

# PCM is default for TTS streaming (best performance)
export VOICEMODE_TTS_AUDIO_FORMAT="pcm"

Advanced Configuration¶

# Different formats for TTS and STT
export VOICEMODE_TTS_AUDIO_FORMAT="pcm"    # For text-to-speech (default)
export VOICEMODE_STT_AUDIO_FORMAT="opus"   # For speech-to-text upload

# Quality settings (for compressed formats)
export VOICEMODE_OPUS_BITRATE="32000"      # Opus bitrate (default: 32kbps)
export VOICEMODE_MP3_BITRATE="64k"         # MP3 bitrate (default: 64k)
export VOICEMODE_AAC_BITRATE="64k"         # AAC bitrate (default: 64k)

Provider Compatibility¶

Voice Mode automatically validates format compatibility with your providers:

Provider	TTS Formats	STT Formats
OpenAI	opus, mp3, aac, flac, wav, pcm	mp3, opus, wav, flac, m4a, webm
Kokoro (local)	mp3, wav	N/A
Whisper.cpp (local)	N/A	wav, mp3, opus, flac, m4a

If you select an unsupported format, Voice Mode will automatically fallback to a compatible format.

Migration from Existing Setup¶

Checking Your Current Setup¶

If you have existing audio files saved with VOICEMODE_SAVE_AUDIO=true, they are likely in MP3 or Opus format. You can check:

ls ~/voicemode_audio/

Gradual Migration¶

You can run multiple formats side-by-side:

Keep existing compressed audio files
TTS streaming uses PCM for best performance
STT uploads can use compressed formats
All formats work seamlessly together

Converting Existing Files¶

To convert existing MP3 files to Opus (optional):

# Using ffmpeg
for file in ~/voicemode_audio/*.mp3; do
    ffmpeg -i "$file" -c:a libopus -b:a 32k "${file%.mp3}.opus"
done

Troubleshooting¶

Issue: "Provider doesn't support format"¶

Voice Mode will automatically fallback to a supported format. You'll see a log message like:

Format 'opus' not supported by kokoro, using 'mp3' instead

Note: PCM is universally supported for streaming.

Issue: "Audio playback issues"¶

Some older systems might have issues with Opus playback. Try:

Update your audio libraries:

# Ubuntu/Debian
sudo apt update && sudo apt install libopus0 libopusfile0

# macOS
brew install opus opus-tools

Or switch to a compressed format:

export VOICEMODE_TTS_AUDIO_FORMAT="mp3"

Issue: "Larger file sizes than expected"¶

Opus files might appear larger if saved in an OGG container. The actual audio data is still compressed efficiently.

Format Comparison¶

Format	File Size*	Quality	Latency	Best For
PCM	N/A (streaming)	Uncompressed	Zero	TTS streaming (default)
Opus	Smallest (100KB)	Excellent for voice	High (buffering required)	STT uploads, saving
MP3	Medium (500KB)	Good	Low	Wide compatibility
AAC	Medium (450KB)	Good	Low	Apple ecosystem
FLAC	Large (2MB)	Lossless	Low	Archival
WAV	Largest (5MB)	Uncompressed	Zero	Local processing

*Approximate sizes for 1 minute of speech

Benefits of PCM for Streaming¶

Zero Latency: No encoding/decoding overhead
Best Performance: Direct audio playback
Universal Support: Works on all systems
Streaming Optimized: No buffering for format conversion
Real-time Ready: Perfect for live conversations

Benefits of Opus for Uploads¶

Bandwidth Efficiency: Crucial for cloud API calls
Small File Size: 50-80% smaller than MP3
Voice Optimized: Designed for speech
Wide Platform Support: Works on modern systems
Future-proof: Active development

Changing Default Formats¶

To change from PCM streaming to compressed formats:

Set environment variables:

# For TTS streaming (consider latency impact)
export VOICEMODE_TTS_AUDIO_FORMAT="opus"

# For STT uploads (already uses compression by default)
export VOICEMODE_STT_AUDIO_FORMAT="mp3"

Or update your MCP configuration as shown above
Restart your MCP client

PCM provides the best streaming performance, but compressed formats are useful for: - Reducing bandwidth usage - Saving audio files - STT uploads to cloud services