Modular Python voice I/O for AI applications. Text-to-speech, speech-to-text, and voice cloning with local-first defaults and remote provider support. One interface across Piper, Supertonic, OpenAI, OmniVoice, and more.
from abstractvoice import VoiceManager
vm = VoiceManager(language="en")
# Speak aloud
vm.speak("Hello from AbstractVoice.")
# Get audio bytes
wav = vm.speak_to_bytes("Headless TTS.", format="wav")
AbstractVoice is a modular voice I/O library providing text-to-speech (TTS), speech-to-text (STT), and optional voice cloning. It integrates with AbstractCore as a capability plugin, enabling any AI application in the ecosystem to speak and listen.
The base install uses OpenAI-compatible TTS and STT by default. Point at any compatible endpoint via OPENAI_BASE_URL or pass remote_base_url directly. No GPU or local model required to get started.
Install abstractvoice[apple] or [gpu] for the full local experience: Supertonic ONNX TTS, Piper TTS, faster-whisper STT, VAD, microphone capture, and local voice cloning engines — all running in-process.
Discovered automatically via entry points when installed alongside AbstractCore. Exposes provider/model/voice discovery, TTS/STT execution, and voice cloning through the unified capability contract.
From text-to-speech synthesis to real-time speech recognition and voice cloning, AbstractVoice covers the full voice I/O stack.
Multiple TTS engines: Supertonic 3 ONNX (local, 10 voices), Piper (local, multilingual), OpenAI-compatible HTTP (remote), AudioDiT, and OmniVoice. Buffered or streamed delivery with audio chunk smoothing.
Local STT via faster-whisper with model size selection (tiny to large). Remote STT via OpenAI-compatible endpoints. Voice Activity Detection (VAD) with webrtcvad for accurate speech boundary detection.
Clone voices from reference audio using F5-TTS, Chroma-4B, AudioDiT, or OmniVoice engines. Clones are stored locally with bundles for portability. Reference text auto-fallback via ADR 0003.
Cross-engine voice profile abstraction with shipped presets per engine. Select voices by language, gender, or profile ID. Runtime TTS switching resets to provider/language defaults automatically.
LLM-to-TTS streaming bridge via TextToSpeechStream. Incremental text chunking with sentence and soft-boundary segmentation. Audio chunk fading to eliminate clicks at boundaries.
Language-aware voice selection across engines. Piper supports 20+ languages with dedicated voice packs. OmniVoice provides omnilingual TTS with zero-shot cloning across languages.
REPL runs with allow_downloads=False. Downloads are explicit via abstractvoice-prefetch. Once models are cached, everything works without network access.
Optional acoustic echo cancellation (AEC) for true barge-in support. Stop-phrase detection for voice-controlled interruption. Configurable voice mode callbacks for speaking behavior.
Get up and running with AbstractVoice in minutes. Choose between remote-first (default) or full local inference.
# Base install (remote OpenAI-compatible TTS/STT)
pip install abstractvoice
# Full local stack for macOS Apple Silicon
pip install "abstractvoice[apple]"
# Full local stack for GPU (Linux/Windows)
pip install "abstractvoice[gpu]"
# Granular extras
pip install "abstractvoice[supertonic,stt,audio-io]"
# Download local TTS models
abstractvoice-prefetch --supertonic
abstractvoice-prefetch --piper en
# Download local STT model
abstractvoice-prefetch --stt small
# Optional cloning backends
abstractvoice-prefetch --omnivoice
abstractvoice-prefetch --openf5
from abstractvoice import VoiceManager
# Remote TTS (reads OPENAI_API_KEY from env)
vm = VoiceManager(language="en")
vm.speak("Hello from AbstractVoice.")
# Get WAV bytes for headless usage
wav = vm.speak_to_bytes("Headless TTS output.", format="wav")
# Fully local TTS + STT
vm_local = VoiceManager(
language="en",
tts_engine="supertonic",
stt_engine="faster_whisper",
)
vm_local.speak("Running entirely offline.")
# Interactive REPL (remote by default)
abstractvoice --verbose
# Fully local REPL
abstractvoice --tts-engine supertonic --stt-engine faster_whisper --verbose
# Local web UI (requires abstractvoice[web])
abstractvoice web --port 5000
The public API surface is centered on VoiceManager for direct usage and the AbstractCore capability plugin for ecosystem integration.
from abstractvoice import VoiceManager
vm = VoiceManager(
language="en",
tts_engine="openai", # or "supertonic", "piper", "omnivoice"
stt_engine="openai", # or "faster_whisper"
remote_base_url="...", # optional OpenAI-compatible endpoint
)
# TTS
vm.speak("Hello world")
wav = vm.speak_to_bytes("Hello", format="wav")
# Engine preload/unload (local engines only)
vm.preload_tts_engine(engine="supertonic")
vm.preload_stt_engine(engine="faster_whisper")
vm.unload_tts_engine()
vm.unload_stt_engine()
# Runtime TTS switching
vm.set_tts_engine("piper")
from abstractcore import create_llm
llm = create_llm("openai")
# TTS via capability plugin
wav = llm.voice.tts(
"Hello from AbstractCore",
provider="openai",
model="tts-1",
voice="alloy",
format="wav",
)
# STT via capability plugin
text = llm.voice.stt(audio_bytes, provider="faster-whisper", model="small")
# Provider and voice discovery
providers = llm.voice.available_providers()
voices = llm.voice.list_tts_voices(provider="supertonic")
models = llm.voice.list_tts_models(provider="openai")
# Voice cloning
llm.voice.clone_voice(reference_audio="ref.wav", name="my-clone")
# Resident model management (local engines)
llm.voice.load_resident_model(provider="supertonic")
llm.voice.list_resident_models()
llm.voice.unload_resident_model(provider="supertonic")
from abstractvoice.tts.text_to_speech_stream import TextToSpeechStream
# Bridge incremental LLM text to TTS audio
stream = TextToSpeechStream(tts_engine=vm.tts_engine)
# Feed tokens as they arrive from the LLM
for token in llm_stream:
stream.feed(token)
stream.finish()
Remote HTTP TTS via /audio/speech. Default engine. Supports any OpenAI-compatible endpoint including local servers.
Local ONNX TTS with 10 fixed profiles (M1–M5, F1–F5). Fast, lightweight, no external SDK dependency. Recommended local base TTS.
Local multilingual TTS with downloadable voice packs. Supports 20+ languages. Voice selection by language, quality tier, and speaker ID.
Omnilingual local TTS with zero-shot voice cloning. Recommended cloning backend. Heavy model — prefetch required.
LongCat-AudioDiT TTS with prompt-audio cloning. MIT-licensed vendored model code. Optional heavy backend.