Voice keyboard is a demo application showcasing Deepgram's new turn-taking speech-to-text API: Flux.
A voice-controlled Linux virtual keyboard that converts speech to text and types it into any application.
As a result of directly targeting Linux as a driver, this works with all Linux applications.
- Voice-to-Text: Real-time speech recognition using Deepgram's Flux API service (turn-taking STT)
- Virtual Keyboard: Creates a virtual input device that works with all applications
- Incremental Typing: Smart transcript updates with minimal backspacing for real-time corrections
The application solves a common Linux privilege problem:
- Virtual keyboard creation requires root access to
/dev/uinput
- Audio input requires user-space access to PipeWire/PulseAudio
Solution: The application starts with root privileges, creates the virtual keyboard, then drops privileges to access the user's audio session.
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://rustup.rs | sh
# Install required system packages (Fedora/RHEL)
sudo dnf install alsa-lib-devel
# Install required system packages (Ubuntu/Debian)
sudo apt install libasound2-dev
git clone <repository-url>
cd voice-keyboard
cargo build
You’ll need a Deepgram API key to authenticate with Flux.
- Create or manage keys in the Deepgram console: Create additional API keys
- Export the key so the app can pick it up (recommended):
export DEEPGRAM_API_KEY="dg_your_api_key_here"
- The client sends the header
Authorization: Token <DEEPGRAM_API_KEY>
. - For CI or systemd services, set
DEEPGRAM_API_KEY
in the environment for the service user. - Security tip: treat API keys like passwords. Prefer env vars over committing keys to files.
Use the provided runner script:
./run.sh
# Build and run with proper privilege handling
cargo build
sudo -E ./target/debug/voice-keyboard --test-stt
Important: Always use sudo -E
to preserve environment variables needed for audio access.
This application uses Deepgram Flux, the company's new turn‑taking STT API. The default WebSocket URL is wss://api.preview.deepgram.com/v2/listen
.
voice-keyboard [OPTIONS]
OPTIONS:
--test-audio Test audio input and show levels
--test-stt Test speech-to-text functionality (default if no other mode specified)
--debug-stt Debug speech-to-text (print transcripts without typing)
--stt-url <URL> Custom STT service URL (default: wss://api.preview.deepgram.com/v2/listen)
-h, --help Print help information
-V, --version Print version information
Note: If no mode is specified, the application defaults to --test-stt
behavior.
- Initialization: Application starts with root privileges
- Virtual Keyboard: Creates
/dev/uinput
device as root - Privilege Drop: Drops to original user privileges
- Audio Access: Accesses PipeWire/PulseAudio in user space
- Speech Recognition: Streams audio to Deepgram Flux STT service
- Incremental Typing: Updates text in real-time with smart backspacing
- Turn Finalization: Clears tracking on "EndOfTurn" events (user presses Enter manually)
The application provides sophisticated real-time transcript updates:
- Incremental Updates: As speech is recognized, the application updates the typed text by finding the common prefix between the current and new transcript, backspacing only the changed portion, and typing the new ending
- Smart Backspacing: Minimizes cursor movement by only removing characters that actually changed
- Turn Management: On "EndOfTurn" events, the application clears its internal tracking but doesn't automatically press Enter, allowing users to review before submitting
- Endpoint:
wss://api.preview.deepgram.com/v2/listen
- What it is: Flux is Deepgram's turn‑taking, low‑latency STT API designed for conversational experiences.
- Authentication: Send an
Authorization
header. Common forms:Token <DEEPGRAM_API_KEY>
(what this app uses)token <DEEPGRAM_API_KEY>
orBearer <JWT>
are also accepted by the platform
- Message types (each server message includes a JSON
type
field):Connected
— initial connection confirmationTurnInfo
— streaming transcription updates with fields:event
(Update
,StartOfTurn
,Preflight
,SpeechResumed
,EndOfTurn
),turn_index
,audio_window_start
,audio_window_end
,transcript
,words[] { word, confidence }
,end_of_turn_confidence
Error
— fatal error with fields:code
,description
(may also include a close code)Configuration
— echoes/acknowledges configuration (e.g., thresholds) when provided
- Client close protocol: After sending your final audio, send a control message:
{ "type": "CloseStream" }
The server will flush any remaining responses and then close the WebSocket.
- Update cadence: Flux produces updates about every 240 ms with a typical worst‑case latency of ~500 ms.
- Common query parameters (as supported by the preview spec):
model
,encoding
,sample_rate
,preflight_threshold
,eot_threshold
,eot_timeout_ms
,keyterm
,mip_opt_out
,tag
- Minimal Root Time: Only root during virtual keyboard creation
- Environment Preservation: Maintains user's audio session access
- Clean Privilege Drop: Properly drops both user and group privileges
- No System Changes: No permanent system configuration required
If you get "Host is down" or "I/O error" when testing audio:
- Use
sudo -E
: Always preserve environment variables - Check PipeWire: Ensure PipeWire is running:
systemctl --user status pipewire
- Test without sudo: Try
./target/debug/voice-keyboard --test-audio
(will fail on keyboard creation but audio should work)
If you get "Permission denied" for /dev/uinput
:
- Check uinput module:
sudo modprobe uinput
- Verify device exists:
ls -la /dev/uinput
- Use sudo: The application is designed to run with
sudo -E
src/
├── main.rs # Main application and privilege dropping
├── virtual_keyboard.rs # Virtual keyboard device management
├── audio_input.rs # Audio capture and processing
├── stt_client.rs # WebSocket STT client
└── input_event.rs # Linux input event constants
- OriginalUser: Captures and restores user context
- VirtualKeyboard: Manages uinput device lifecycle with smart transcript updates
- AudioInput: Cross-platform audio capture
- SttClient: WebSocket-based speech-to-text client
- AudioBuffer: Manages audio chunking for STT streaming
ISC License. See LICENSE.txt