VoiceMode Architecture¶

Understanding how VoiceMode components work together to enable voice conversations.

System Overview¶

VoiceMode is built as a Model Context Protocol (MCP) server that provides voice capabilities to AI assistants. It follows a modular architecture with clear separation between voice services, audio processing, and client interfaces.

┌─────────────────────────────────────────────┐
│             MCP Client (Claude)             │
└─────────────────┬───────────────────────────┘
                  │ MCP Protocol
┌─────────────────┴───────────────────────────┐
│           VoiceMode MCP Server              │
├──────────────────────────────────────────────┤
│              Core Components                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │  Tools   │  │ Providers│  │  Config  │  │
│  └──────────┘  └──────────┘  └──────────┘  │
├──────────────────────────────────────────────┤
│            Voice Services                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Whisper  │  │  Kokoro  │  │ LiveKit  │  │
│  │  (STT)   │  │  (TTS)   │  │  (RTC)   │  │
│  └──────────┘  └──────────┘  └──────────┘  │
└──────────────────────────────────────────────┘

Core Components¶

MCP Server¶

The FastMCP-based server (server.py) is the entry point that: - Exposes tools, resources, and prompts via MCP protocol - Handles stdio transport for communication - Manages service lifecycle and health checks - Auto-imports all tools from the tools directory

Tools System¶

Tools are the primary interface for voice interactions:

converse: Main voice conversation tool - Handles audio recording and playback - Manages TTS/STT service selection - Implements silence detection and VAD - Supports multiple transport methods (local, LiveKit)

Service tools: Installation and management - whisper_install, kokoro_install, livekit_install - Service start/stop/status operations - Model and configuration management

Provider System¶

The provider system (providers.py) implements service discovery and failover:

Discovery: Automatically finds running services
Health Checks: Validates service availability
Failover: Falls back to alternative services
Load Balancing: Distributes requests across providers

Provider selection priority: 1. User-specified URL (environment variable) 2. Local services (auto-discovered) 3. Cloud services (OpenAI)

Configuration Layer¶

Multi-layered configuration system (config.py):

Environment Variables: Highest priority
Project Config: .voicemode.env in working directory
User Config: ~/.voicemode/voicemode.env
Defaults: Built-in sensible defaults

Voice Services¶

Whisper (Speech-to-Text)¶

Local STT service using OpenAI's Whisper model: - Runs on port 2022 by default - Provides OpenAI-compatible API - Supports multiple model sizes - Hardware acceleration (Metal, CUDA)

Kokoro (Text-to-Speech)¶

Local TTS service with natural voices: - Runs on port 8880 by default - OpenAI-compatible API - Multiple languages and voices - Efficient caching system

LiveKit (Real-Time Communication)¶

WebRTC-based room communication: - Server on port 7880 - Frontend on port 3000 - Room-based architecture - Low-latency audio transport

Audio Pipeline¶

Recording Flow¶

Microphone → Audio Capture → VAD → Silence Detection → STT Service → Text

Audio Capture: PyAudio or LiveKit SDK
VAD: WebRTC VAD filters non-speech
Silence Detection: Determines recording end
STT Processing: Converts audio to text

Playback Flow¶

Text → TTS Service → Audio Stream → Format Conversion → Speaker

TTS Generation: Creates audio from text
Streaming: Chunks for real-time playback
Format Conversion: FFmpeg handles formats
Playback: PyAudio or LiveKit output

Service Architecture¶

Service Lifecycle¶

Installation: Download binaries, create configs
Registration: systemd/launchd service files
Startup: Health checks, port binding
Discovery: Auto-detection by VoiceMode
Monitoring: Status checks, log rotation

Service Communication¶

All services expose OpenAI-compatible APIs: - Unified interface for TTS/STT - Standard authentication (API keys) - Consistent error handling - Format negotiation

Transport Methods¶

Local Transport¶

Direct microphone/speaker access: - PyAudio for audio I/O - Low latency - No network overhead - Privacy-focused

LiveKit Transport¶

Room-based WebRTC communication: - Multi-participant support - Network resilient - Browser compatible - Scalable architecture

Frontend Architecture¶

Next.js Application¶

The web frontend (frontend/) provides: - Voice conversation UI - Room management - Real-time status - WebRTC integration

Build System¶

Frontend is bundled with Python package: 1. Built during package creation 2. Served by MCP server 3. Auto-installed dependencies 4. Hot reload in development

Security Model¶

API Key Management¶

Never stored in code
Environment variable priority
Secure MCP transport
Optional local-only mode

Audio Privacy¶

Local processing option
No cloud storage
Encrypted transport (LiveKit)
User-controlled recording

Performance Optimization¶

Caching Strategy¶

Model caching (Whisper/Kokoro)
Audio format caching
Provider health caching
Configuration caching

Resource Management¶

Lazy service loading
Connection pooling
Memory limits (systemd)
CPU throttling

Error Handling¶

Graceful Degradation¶

Primary service fails
Attempt fallback service
Use cloud service if available
Return informative error

Recovery Mechanisms¶

Automatic service restart
Connection retry logic
Circuit breaker pattern
Health check recovery

Extension Points¶

Adding New Tools¶

Create tool in tools/ directory
Implement with FastMCP decorators
Auto-imported by server
Available via MCP

Custom Providers¶

Implement provider interface
Add discovery logic
Register in provider system
Configure endpoints

Service Integration¶

Create service installer
Add systemd/launchd templates
Implement health checks
Update CLI commands

Deployment Patterns¶

Development¶

Local services
Debug logging
Hot reload
Mock providers

Production¶

Service supervision
Log rotation
Health monitoring
Failover configuration

Containerized¶

Docker compose setup
Service orchestration
Volume management
Network isolation

Future Architecture¶

Planned Enhancements¶

Plugin system for tools
Webhook support
Multi-language support
GPU cluster support

Scalability Path¶

Distributed services
Queue-based processing
Caching layers
Load balancing