unit-of-trustobserve-ui
← back to the lab

local tutor.

Fully local voice tutor for Mandarin and French — speak, it listens, replies, and talks back. Nothing leaves the machine.

local tutor
Claude Code/FastAPI/React/MLX Whisper/Kokoro/LM Studio

What it is

A hands-free voice tutor for practicing Mandarin and French, running entirely on my Mac. I talk; it transcribes with Whisper (Metal-native via MLX), a local LLM through LM Studio replies in the target language, and Kokoro speaks the reply back — streamed sentence by sentence so the first audio lands fast. A difficulty slider pitches the tutor from "most common words, tiny sentences" to natural native phrasing, and a speed slider slows the voice down to learner pace. No accounts, no cloud, no telemetry; conversations live in a local SQLite file.

Why I made it

I wanted real conversation practice without shipping my fumbling beginner Mandarin to a server somewhere — and without paying per minute for it. The first version had a fatal flaw every language learner will recognize: it treated my thinking pauses as the end of my turn and cut me off mid-sentence. That problem turned out to be the most interesting part of the build.

How I built it

Claude Code orchestrated the whole rewrite with subagent waves — an implementer plus spec and quality reviewers per task. The browser owns the mic and runs Silero VAD locally (ONNX in the page, no CDN); a FastAPI server pipelines STT → LLM → sentence-chunked TTS over a WebSocket, with barge-in support so I can interrupt the tutor.

The pause problem got a proper fix: instead of a flat silence window, an 8MB end-of-turn classifier (smart-turn v3, ~8ms on CPU) listens after each short pause and decides whether I sound finished. Finished sentence → commits in a quarter second; trailing off mid-thought → it waits, speculatively transcribing in the background so the eventual reply is faster.

What I learned

Mixed-language speech breaks TTS in sneaky ways. The Chinese G2P silently dropped English words without an explicit English phonemizer wired in, and espeak's French mode wrapped embedded English in (en)...(fr) switch flags that the voice model happily pronounced as stray sounds around every word. Both were one-line-of-insight fixes that took real listening to find — local audio pipelines fail audibly, not loudly.