VoiceMode Architecture¶
Understanding how VoiceMode components work together to enable voice conversations.
System Overview¶
VoiceMode is built as a Model Context Protocol (MCP) server that provides voice capabilities to AI assistants. It follows a modular architecture with clear separation between voice services, audio processing, and client interfaces.
┌─────────────────────────────────────────────┐
│ MCP Client (Claude) │
└─────────────────┬───────────────────────────┘
│ MCP Protocol
┌─────────────────┴───────────────────────────┐
│ VoiceMode MCP Server │
├──────────────────────────────────────────────┤
│ Core Components │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Tools │ │ Providers│ │ Config │ │
│ └──────────┘ └──────────┘ └──────────┘ │
├──────────────────────────────────────────────┤
│ Voice Services │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Whisper │ │ Kokoro │ │ LiveKit │ │
│ │ (STT) │ │ (TTS) │ │ (RTC) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────┘
Core Components¶
MCP Server¶
The FastMCP-based server (server.py) is the entry point that: - Exposes tools, resources, and prompts via MCP protocol - Handles stdio transport for communication - Manages service lifecycle and health checks - Auto-imports all tools from the tools directory
Tools System¶
Tools are the primary interface for voice interactions:
converse: Main voice conversation tool - Handles audio recording and playback - Manages TTS/STT service selection - Implements silence detection and VAD - Supports multiple transport methods (local, LiveKit)
Service tools: Installation and management - whisper_install, kokoro_install, livekit_install - Service start/stop/status operations - Model and configuration management
Provider System¶
The provider system (providers.py) implements service discovery and failover:
- Discovery: Automatically finds running services
- Health Checks: Validates service availability
- Failover: Falls back to alternative services
- Load Balancing: Distributes requests across providers
Provider selection priority: 1. User-specified URL (environment variable) 2. Local services (auto-discovered) 3. Cloud services (OpenAI)
Configuration Layer¶
Multi-layered configuration system (config.py):
- Environment Variables: Highest priority
- Project Config:
.voicemode.envin working directory - User Config:
~/.voicemode/voicemode.env - Defaults: Built-in sensible defaults
Voice Services¶
Whisper (Speech-to-Text)¶
Local STT service using OpenAI's Whisper model: - Runs on port 2022 by default - Provides OpenAI-compatible API - Supports multiple model sizes - Hardware acceleration (Metal, CUDA)
Kokoro (Text-to-Speech)¶
Local TTS service with natural voices: - Runs on port 8880 by default - OpenAI-compatible API - Multiple languages and voices - Efficient caching system
LiveKit (Real-Time Communication)¶
WebRTC-based room communication: - Server on port 7880 - Frontend on port 3000 - Room-based architecture - Low-latency audio transport
Audio Pipeline¶
Recording Flow¶
- Audio Capture: PyAudio or LiveKit SDK
- VAD: WebRTC VAD filters non-speech
- Silence Detection: Determines recording end
- STT Processing: Converts audio to text
Playback Flow¶
- TTS Generation: Creates audio from text
- Streaming: Chunks for real-time playback
- Format Conversion: FFmpeg handles formats
- Playback: PyAudio or LiveKit output
Service Architecture¶
Service Lifecycle¶
- Installation: Download binaries, create configs
- Registration: systemd/launchd service files
- Startup: Health checks, port binding
- Discovery: Auto-detection by VoiceMode
- Monitoring: Status checks, log rotation
Service Communication¶
All services expose OpenAI-compatible APIs: - Unified interface for TTS/STT - Standard authentication (API keys) - Consistent error handling - Format negotiation
Transport Methods¶
Local Transport¶
Direct microphone/speaker access: - PyAudio for audio I/O - Low latency - No network overhead - Privacy-focused
LiveKit Transport¶
Room-based WebRTC communication: - Multi-participant support - Network resilient - Browser compatible - Scalable architecture
Frontend Architecture¶
Next.js Application¶
The web frontend (frontend/) provides: - Voice conversation UI - Room management - Real-time status - WebRTC integration
Build System¶
Frontend is bundled with Python package: 1. Built during package creation 2. Served by MCP server 3. Auto-installed dependencies 4. Hot reload in development
Security Model¶
API Key Management¶
- Never stored in code
- Environment variable priority
- Secure MCP transport
- Optional local-only mode
Audio Privacy¶
- Local processing option
- No cloud storage
- Encrypted transport (LiveKit)
- User-controlled recording
Performance Optimization¶
Caching Strategy¶
- Model caching (Whisper/Kokoro)
- Audio format caching
- Provider health caching
- Configuration caching
Resource Management¶
- Lazy service loading
- Connection pooling
- Memory limits (systemd)
- CPU throttling
Error Handling¶
Graceful Degradation¶
- Primary service fails
- Attempt fallback service
- Use cloud service if available
- Return informative error
Recovery Mechanisms¶
- Automatic service restart
- Connection retry logic
- Circuit breaker pattern
- Health check recovery
Extension Points¶
Adding New Tools¶
- Create tool in
tools/directory - Implement with FastMCP decorators
- Auto-imported by server
- Available via MCP
Custom Providers¶
- Implement provider interface
- Add discovery logic
- Register in provider system
- Configure endpoints
Service Integration¶
- Create service installer
- Add systemd/launchd templates
- Implement health checks
- Update CLI commands
Deployment Patterns¶
Development¶
- Local services
- Debug logging
- Hot reload
- Mock providers
Production¶
- Service supervision
- Log rotation
- Health monitoring
- Failover configuration
Containerized¶
- Docker compose setup
- Service orchestration
- Volume management
- Network isolation
Future Architecture¶
Planned Enhancements¶
- Plugin system for tools
- Webhook support
- Multi-language support
- GPU cluster support
Scalability Path¶
- Distributed services
- Queue-based processing
- Caching layers
- Load balancing