AbstractVoice — Documentation

Overview

Voice I/O for the Abstract Ecosystem

AbstractVoice is a modular voice I/O library providing text-to-speech (TTS), speech-to-text (STT), and optional voice cloning. It integrates with AbstractCore as a capability plugin, enabling any AI application in the ecosystem to speak and listen.

Remote-First Default

The base install uses OpenAI-compatible TTS and STT by default. Point at any compatible endpoint via OPENAI_BASE_URL or pass remote_base_url directly. No GPU or local model required to get started.

Local Inference Stack

Install abstractvoice[apple] or [gpu] for the full local experience: Supertonic ONNX TTS, Piper TTS, faster-whisper STT, VAD, microphone capture, and local voice cloning engines — all running in-process.

AbstractCore Plugin

Discovered automatically via entry points when installed alongside AbstractCore. Exposes provider/model/voice discovery, TTS/STT execution, and voice cloning through the unified capability contract.

Features

Complete Voice Pipeline

From text-to-speech synthesis to real-time speech recognition and voice cloning, AbstractVoice covers the full voice I/O stack.

Text-to-Speech

Multiple TTS engines: Supertonic 3 ONNX (local, 10 voices), Piper (local, multilingual), OpenAI-compatible HTTP (remote), AudioDiT, and OmniVoice. Buffered or streamed delivery with audio chunk smoothing.

Speech-to-Text

Local STT via faster-whisper with model size selection (tiny to large). Remote STT via OpenAI-compatible endpoints. Voice Activity Detection (VAD) with webrtcvad for accurate speech boundary detection.

Voice Cloning

Clone voices from reference audio using F5-TTS, Chroma-4B, AudioDiT, or OmniVoice engines. Clones are stored locally with bundles for portability. Reference text auto-fallback via ADR 0003.

Voice Profiles

Cross-engine voice profile abstraction with shipped presets per engine. Select voices by language, gender, or profile ID. Runtime TTS switching resets to provider/language defaults automatically.

Streaming Pipeline

LLM-to-TTS streaming bridge via TextToSpeechStream. Incremental text chunking with sentence and soft-boundary segmentation. Audio chunk fading to eliminate clicks at boundaries.

Multilingual

Language-aware voice selection across engines. Piper supports 20+ languages with dedicated voice packs. OmniVoice provides omnilingual TTS with zero-shot cloning across languages.

Offline-First

REPL runs with allow_downloads=False. Downloads are explicit via abstractvoice-prefetch. Once models are cached, everything works without network access.

Echo Cancellation

Optional acoustic echo cancellation (AEC) for true barge-in support. Stop-phrase detection for voice-controlled interruption. Configurable voice mode callbacks for speaking behavior.

Getting Started

Install & First Words

Get up and running with AbstractVoice in minutes. Choose between remote-first (default) or full local inference.

Installation

# Base install (remote OpenAI-compatible TTS/STT)
pip install abstractvoice

# Full local stack for macOS Apple Silicon
pip install "abstractvoice[apple]"

# Full local stack for GPU (Linux/Windows)
pip install "abstractvoice[gpu]"

# Granular extras
pip install "abstractvoice[supertonic,stt,audio-io]"

Prefetch Models (Offline-Friendly)

# Download local TTS models
abstractvoice-prefetch --supertonic
abstractvoice-prefetch --piper en

# Download local STT model
abstractvoice-prefetch --stt small

# Optional cloning backends
abstractvoice-prefetch --omnivoice
abstractvoice-prefetch --openf5

Quick Start (Python)

from abstractvoice import VoiceManager

# Remote TTS (reads OPENAI_API_KEY from env)
vm = VoiceManager(language="en")
vm.speak("Hello from AbstractVoice.")

# Get WAV bytes for headless usage
wav = vm.speak_to_bytes("Headless TTS output.", format="wav")

# Fully local TTS + STT
vm_local = VoiceManager(
    language="en",
    tts_engine="supertonic",
    stt_engine="faster_whisper",
)
vm_local.speak("Running entirely offline.")

CLI & REPL

# Interactive REPL (remote by default)
abstractvoice --verbose

# Fully local REPL
abstractvoice --tts-engine supertonic --stt-engine faster_whisper --verbose

# Local web UI (requires abstractvoice[web])
abstractvoice web --port 5000

API Reference

Key Classes & Methods

The public API surface is centered on VoiceManager for direct usage and the AbstractCore capability plugin for ecosystem integration.

VoiceManager — Core Façade

from abstractvoice import VoiceManager

vm = VoiceManager(
    language="en",
    tts_engine="openai",       # or "supertonic", "piper", "omnivoice"
    stt_engine="openai",       # or "faster_whisper"
    remote_base_url="...",    # optional OpenAI-compatible endpoint
)

# TTS
vm.speak("Hello world")
wav = vm.speak_to_bytes("Hello", format="wav")

# Engine preload/unload (local engines only)
vm.preload_tts_engine(engine="supertonic")
vm.preload_stt_engine(engine="faster_whisper")
vm.unload_tts_engine()
vm.unload_stt_engine()

# Runtime TTS switching
vm.set_tts_engine("piper")

AbstractCore Plugin Integration

from abstractcore import create_llm

llm = create_llm("openai")

# TTS via capability plugin
wav = llm.voice.tts(
    "Hello from AbstractCore",
    provider="openai",
    model="tts-1",
    voice="alloy",
    format="wav",
)

# STT via capability plugin
text = llm.voice.stt(audio_bytes, provider="faster-whisper", model="small")

# Provider and voice discovery
providers = llm.voice.available_providers()
voices = llm.voice.list_tts_voices(provider="supertonic")
models = llm.voice.list_tts_models(provider="openai")

# Voice cloning
llm.voice.clone_voice(reference_audio="ref.wav", name="my-clone")

# Resident model management (local engines)
llm.voice.load_resident_model(provider="supertonic")
llm.voice.list_resident_models()
llm.voice.unload_resident_model(provider="supertonic")

Streaming TTS (LLM → Voice)

from abstractvoice.tts.text_to_speech_stream import TextToSpeechStream

# Bridge incremental LLM text to TTS audio
stream = TextToSpeechStream(tts_engine=vm.tts_engine)

# Feed tokens as they arrive from the LLM
for token in llm_stream:
    stream.feed(token)
stream.finish()

Available TTS Engines

OpenAI / Compatible

Remote HTTP TTS via /audio/speech. Default engine. Supports any OpenAI-compatible endpoint including local servers.

Supertonic 3

Local ONNX TTS with 10 fixed profiles (M1–M5, F1–F5). Fast, lightweight, no external SDK dependency. Recommended local base TTS.

Piper

Local multilingual TTS with downloadable voice packs. Supports 20+ languages. Voice selection by language, quality tier, and speaker ID.

OmniVoice

Omnilingual local TTS with zero-shot voice cloning. Recommended cloning backend. Heavy model — prefetch required.

AudioDiT

LongCat-AudioDiT TTS with prompt-audio cloning. MIT-licensed vendored model code. Optional heavy backend.