@valora-ai/voice
Framework-agnostic, fully-local voice-agent runtime — VAD → STT → LLM → TTS, one state machine, no server.
A small library for building live voice assistants that run entirely in the browser. Hosted voice-agent SDKs inspired the shape — one state enum, one reactive hook, dumb components, pluggable engines — but the body runs on-device: no rooms, no WebRTC, no server.
npm install @valora-ai/voiceArchitecture
VAD ──► STT ──► [TurnDetector] ──► LLM ──► Speaker(TTS+Player)
│ │
onSpeechStart echo-suppress VAD
(barge-in) │
└──────────────── turn token ────────────────┘
│
reactive store ──► subscribe()/getSnapshot()
│ │
vanilla UI React useVoiceAgentcreateVoiceAgent(opts)— the state machine. States:loading | idle | listening | thinking | speaking. A single monotonicturntoken is the only abandonment primitive — barge-in bumps it, stale work checksmyTurn !== turn.- Reactive store — the sole notification surface, implementing the
useSyncExternalStorecontract (subscribe+getSnapshot). Snapshot ={ state, level, segments, metrics }, wheremetrics.firstAudioMsis the number users actually feel — usable as an SLA gate. - Sentence splitter — abbreviation/decimal/ellipsis-aware boundary detection, so streaming TTS doesn't break on "Dr. Smith" or "3.14".
- Speaker — owns sentence-splitting and echo-suppression (raises VAD sensitivity while speaking) and streams playback: the agent speaks sentence 1 while the LLM is still generating sentence 2.
- TurnDetector — pluggable end-of-utterance detection. Default is pure VAD
silence;
heuristicTurnDetectorkeeps listening on trailing connectives. AmaxTurnWaitMscap commits regardless, so the user is never stranded. - Engines — model capabilities use
TranscribeEngine,InferenceEngine, andSpeechEnginefrom@valora-ai/provider, plus voice-ownedVADEngineandPlayerEngine. The core never touchesAudioContextdirectly.
Quick start
import { createVoiceAgent, heuristicTurnDetector } from '@valora-ai/voice';
import { BrowserPlayer } from '@valora-ai/voice/ui/vanilla';
const agent = createVoiceAgent({
vad,
stt,
llm,
tts, // pluggable engines — Valora's provider packages ship real implementations
player: new BrowserPlayer(),
turnDetector: heuristicTurnDetector,
onError: console.error,
});
agent.subscribe(() => render(agent.getSnapshot()));
await agent.start();
agent.unlock(); // call from a user gesture — resumes a suspended AudioContext
agent.interrupt(); // barge-in
agent.mute(true);
await agent.stop(); // restart-safe — start() again worksSession-shaped local realtime
Use createLocalRealtimeSession(agent) when you want realtime-session
ergonomics without hosted transport:
import { createLocalRealtimeSession } from '@valora-ai/voice';
const session = createLocalRealtimeSession(agent);
session.subscribe(() => render(session.getSnapshot()));
await session.connect();
session.sendText('turn on the lights');
session.interrupt();
await session.disconnect();The wrapper adds status and messages to the existing agent snapshot. It does
not add WebRTC, WebSocket, SIP, tokens, or remote fallback.
Local actions and lifecycle events
onEvent observes state, speech, turn, segment, first-audio, interrupt,
barge-in, action, and error events. actions run after transcription + turn
detection and before LLM generation:
const agent = createVoiceAgent({
vad,
stt,
llm,
tts,
player,
onEvent: (event) => console.log(event.type),
actions: [
{
id: 'lights',
match: (text) => text.includes('lights'),
execute: ({ text }) => ({ handled: true, reply: `Done: ${text}` }),
},
],
});If an action returns { handled: true }, the LLM is skipped and the optional
reply still uses the normal transcript, cancellation, TTS, and error path.
React
import { useVoiceAgent } from '@valora-ai/voice/react';
function Voice({ agent }) {
const { state, level, segments, metrics } = useVoiceAgent(agent); // useSyncExternalStore
}Real in-browser engines (WebGPU)
import { BrowserPlayer, createVoiceAgent } from '@valora-ai/voice';
import { SileroVAD } from '@valora-ai/silero-vad/provider';
import { WhisperSTT } from '@valora-ai/whisper/provider';
import { KokoroTTSEngine } from '@valora-ai/kokoro/provider';
import { createLfm2Provider } from '@valora-ai/lfm2/provider';
const lfm2 = createLfm2Provider({
models: {
'team-chat': {
source: { type: 'huggingface', repo: 'LiquidAI/LFM2.5-350M-GGUF' },
},
},
});
const agent = createVoiceAgent({
vad: new SileroVAD(),
stt: await WhisperSTT.create(),
tts: await KokoroTTSEngine.create(),
player: new BrowserPlayer(),
llm: lfm2.languageModel('team-chat'),
});Provider registry — optional picker infrastructure
createVoiceAgent does not need an AI SDK model and does not need a registry.
It consumes native Valora engines. Use the registry only when the app wants a
catalogue/picker UI: a ModelCard[] per modality, behind one loader.
import { createGemmaProvider } from '@valora-ai/gemma/provider';
import { kokoroProvider } from '@valora-ai/kokoro/provider';
import { createLfm2Provider } from '@valora-ai/lfm2/provider';
import { moonshineProvider } from '@valora-ai/moonshine-stt/provider';
import { createRegistry } from '@valora-ai/provider';
import { sileroProvider } from '@valora-ai/silero-vad/provider';
import { whisperProvider } from '@valora-ai/whisper/provider';
const lfm2 = createLfm2Provider({
auth: { type: 'bearer', token: hfToken },
});
const gemma = createGemmaProvider({
auth: { type: 'bearer', token: hfToken },
});
const models = createRegistry([
lfm2.asModelProvider(),
gemma.asModelProvider(),
moonshineProvider,
whisperProvider,
kokoroProvider,
sileroProvider,
]);
models.catalog('llm'); // ModelCard[] — drives a picker UI
await models.loadLLM('lfm2.5-230m', { onProgress }); // → InferenceEngineYou decide which providers go into your registry. Extending a provider with a
new model card (a small, reviewable PR), or adding a whole model family as a new
provider, needs no changes anywhere else. Because the concrete provider
packages are imported directly by the app, npm and bun only install the
providers you actually use. Static exports such as lfm2Provider still exist
for the built-in catalogue; configured providers use .asModelProvider() so the
registry loads the card-owned source instead of caller-supplied data.
Two things you must get right
- Audio unlock.
AudioContextstarts suspended (browser autoplay policy). Callagent.unlock()from the first user gesture;BrowserPlayeralso has a timeout fallback so it can't hang. - Barge-in everywhere. Interruption must be checked between every stage —
the
turntoken does this for free, since any stale stage just no-ops.
See the @valora-ai/voice API reference for the full surface.