@valora-ai/voice

Framework-agnostic, fully-local voice-agent runtime — VAD → STT → LLM → TTS, one state machine, no server.

A small library for building live voice assistants that run entirely in the browser. Hosted voice-agent SDKs inspired the shape — one state enum, one reactive hook, dumb components, pluggable engines — but the body runs on-device: no rooms, no WebRTC, no server.

npm install @valora-ai/voice

Architecture

     VAD ──► STT ──► [TurnDetector] ──► LLM ──► Speaker(TTS+Player)
      │                                            │
onSpeechStart                                 echo-suppress VAD
(barge-in)                                         │
      └──────────────── turn token ────────────────┘
                           │
                 reactive store  ──►  subscribe()/getSnapshot()
                           │                    │
                      vanilla UI          React useVoiceAgent

createVoiceAgent(opts) — the state machine. States: loading | idle | listening | thinking | speaking. A single monotonic turn token is the only abandonment primitive — barge-in bumps it, stale work checks myTurn !== turn.
Reactive store — the sole notification surface, implementing the useSyncExternalStore contract (subscribe + getSnapshot). Snapshot = { state, level, segments, metrics }, where metrics.firstAudioMs is the number users actually feel — usable as an SLA gate.
Sentence splitter — abbreviation/decimal/ellipsis-aware boundary detection, so streaming TTS doesn't break on "Dr. Smith" or "3.14".
Speaker — owns sentence-splitting and echo-suppression (raises VAD sensitivity while speaking) and streams playback: the agent speaks sentence 1 while the LLM is still generating sentence 2.
TurnDetector — pluggable end-of-utterance detection. Default is pure VAD silence; heuristicTurnDetector keeps listening on trailing connectives. A maxTurnWaitMs cap commits regardless, so the user is never stranded.
Engines — model capabilities use TranscribeEngine, InferenceEngine, and SpeechEngine from @valora-ai/provider, plus voice-owned VADEngine and PlayerEngine. The core never touches AudioContext directly.

Quick start

import { createVoiceAgent, heuristicTurnDetector } from '@valora-ai/voice';
import { BrowserPlayer } from '@valora-ai/voice/ui/vanilla';

const agent = createVoiceAgent({
  vad,
  stt,
  llm,
  tts, // pluggable engines — Valora's provider packages ship real implementations
  player: new BrowserPlayer(),
  turnDetector: heuristicTurnDetector,
  onError: console.error,
});

agent.subscribe(() => render(agent.getSnapshot()));
await agent.start();
agent.unlock(); // call from a user gesture — resumes a suspended AudioContext
agent.interrupt(); // barge-in
agent.mute(true);
await agent.stop(); // restart-safe — start() again works

Session-shaped local realtime

Use createLocalRealtimeSession(agent) when you want realtime-session ergonomics without hosted transport:

import { createLocalRealtimeSession } from '@valora-ai/voice';

const session = createLocalRealtimeSession(agent);
session.subscribe(() => render(session.getSnapshot()));

await session.connect();
session.sendText('turn on the lights');
session.interrupt();
await session.disconnect();

The wrapper adds status and messages to the existing agent snapshot. It does not add WebRTC, WebSocket, SIP, tokens, or remote fallback.

Local actions and lifecycle events

onEvent observes state, speech, turn, segment, first-audio, interrupt, barge-in, action, and error events. actions run after transcription + turn detection and before LLM generation:

const agent = createVoiceAgent({
  vad,
  stt,
  llm,
  tts,
  player,
  onEvent: (event) => console.log(event.type),
  actions: [
    {
      id: 'lights',
      match: (text) => text.includes('lights'),
      execute: ({ text }) => ({ handled: true, reply: `Done: ${text}` }),
    },
  ],
});

If an action returns { handled: true }, the LLM is skipped and the optional reply still uses the normal transcript, cancellation, TTS, and error path.

React

import { useVoiceAgent } from '@valora-ai/voice/react';

function Voice({ agent }) {
  const { state, level, segments, metrics } = useVoiceAgent(agent); // useSyncExternalStore
}

Real in-browser engines (WebGPU)

import { BrowserPlayer, createVoiceAgent } from '@valora-ai/voice';
import { SileroVAD } from '@valora-ai/silero-vad/provider';
import { WhisperSTT } from '@valora-ai/whisper/provider';
import { KokoroTTSEngine } from '@valora-ai/kokoro/provider';
import { createLfm2Provider } from '@valora-ai/lfm2/provider';

const lfm2 = createLfm2Provider({
  models: {
    'team-chat': {
      source: { type: 'huggingface', repo: 'LiquidAI/LFM2.5-350M-GGUF' },
    },
  },
});

const agent = createVoiceAgent({
  vad: new SileroVAD(),
  stt: await WhisperSTT.create(),
  tts: await KokoroTTSEngine.create(),
  player: new BrowserPlayer(),
  llm: lfm2.languageModel('team-chat'),
});

Provider registry — optional picker infrastructure

createVoiceAgent does not need an AI SDK model and does not need a registry. It consumes native Valora engines. Use the registry only when the app wants a catalogue/picker UI: a ModelCard[] per modality, behind one loader.

import { createGemmaProvider } from '@valora-ai/gemma/provider';
import { kokoroProvider } from '@valora-ai/kokoro/provider';
import { createLfm2Provider } from '@valora-ai/lfm2/provider';
import { moonshineProvider } from '@valora-ai/moonshine-stt/provider';
import { createRegistry } from '@valora-ai/provider';
import { sileroProvider } from '@valora-ai/silero-vad/provider';
import { whisperProvider } from '@valora-ai/whisper/provider';

const lfm2 = createLfm2Provider({
  auth: { type: 'bearer', token: hfToken },
});
const gemma = createGemmaProvider({
  auth: { type: 'bearer', token: hfToken },
});

const models = createRegistry([
  lfm2.asModelProvider(),
  gemma.asModelProvider(),
  moonshineProvider,
  whisperProvider,
  kokoroProvider,
  sileroProvider,
]);

models.catalog('llm'); // ModelCard[] — drives a picker UI
await models.loadLLM('lfm2.5-230m', { onProgress }); // → InferenceEngine

You decide which providers go into your registry. Extending a provider with a new model card (a small, reviewable PR), or adding a whole model family as a new provider, needs no changes anywhere else. Because the concrete provider packages are imported directly by the app, npm and bun only install the providers you actually use. Static exports such as lfm2Provider still exist for the built-in catalogue; configured providers use .asModelProvider() so the registry loads the card-owned source instead of caller-supplied data.

Two things you must get right

Audio unlock. AudioContext starts suspended (browser autoplay policy). Call agent.unlock() from the first user gesture; BrowserPlayer also has a timeout fallback so it can't hang.
Barge-in everywhere. Interruption must be checked between every stage — the turn token does this for free, since any stale stage just no-ops.

See the @valora-ai/voice API reference for the full surface.

@valora-ai/voice

On this page