Skip to content

deepgram/voice-keyboard-linux

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voice Keyboard

Voice keyboard is a demo application showcasing Deepgram's new turn-taking speech-to-text API: Flux.

A voice-controlled Linux virtual keyboard that converts speech to text and types it into any application.

As a result of directly targeting Linux as a driver, this works with all Linux applications.

Features

  • Voice-to-Text: Real-time speech recognition using Deepgram's Flux API service (turn-taking STT)
  • Virtual Keyboard: Creates a virtual input device that works with all applications
  • Incremental Typing: Smart transcript updates with minimal backspacing for real-time corrections

Architecture

The application solves a common Linux privilege problem:

  • Virtual keyboard creation requires root access to /dev/uinput
  • Audio input requires user-space access to PipeWire/PulseAudio

Solution: The application starts with root privileges, creates the virtual keyboard, then drops privileges to access the user's audio session.

Installation

Prerequisites

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://rustup.rs | sh

# Install required system packages (Fedora/RHEL)
sudo dnf install alsa-lib-devel

# Install required system packages (Ubuntu/Debian)
sudo apt install libasound2-dev

Build

git clone <repository-url>
cd voice-keyboard
cargo build

Acquire a Deepgram API key

You’ll need a Deepgram API key to authenticate with Flux.

  • Create or manage keys in the Deepgram console: Create additional API keys
  • Export the key so the app can pick it up (recommended):
    export DEEPGRAM_API_KEY="dg_your_api_key_here"
  • The client sends the header Authorization: Token <DEEPGRAM_API_KEY>.
  • For CI or systemd services, set DEEPGRAM_API_KEY in the environment for the service user.
  • Security tip: treat API keys like passwords. Prefer env vars over committing keys to files.

Usage

Easy Method (Recommended)

Use the provided runner script:

./run.sh

Manual Method

# Build and run with proper privilege handling
cargo build
sudo -E ./target/debug/voice-keyboard --test-stt

Important: Always use sudo -E to preserve environment variables needed for audio access.

Speech-to-Text Service

This application uses Deepgram Flux, the company's new turn‑taking STT API. The default WebSocket URL is wss://api.preview.deepgram.com/v2/listen.

Command Line Options

voice-keyboard [OPTIONS]

OPTIONS:
    --test-audio        Test audio input and show levels
    --test-stt          Test speech-to-text functionality (default if no other mode specified)
    --debug-stt         Debug speech-to-text (print transcripts without typing)
    --stt-url <URL>     Custom STT service URL (default: wss://api.preview.deepgram.com/v2/listen)
    -h, --help          Print help information
    -V, --version       Print version information

Note: If no mode is specified, the application defaults to --test-stt behavior.

How It Works

  1. Initialization: Application starts with root privileges
  2. Virtual Keyboard: Creates /dev/uinput device as root
  3. Privilege Drop: Drops to original user privileges
  4. Audio Access: Accesses PipeWire/PulseAudio in user space
  5. Speech Recognition: Streams audio to Deepgram Flux STT service
  6. Incremental Typing: Updates text in real-time with smart backspacing
  7. Turn Finalization: Clears tracking on "EndOfTurn" events (user presses Enter manually)

Transcript Handling

The application provides sophisticated real-time transcript updates:

  • Incremental Updates: As speech is recognized, the application updates the typed text by finding the common prefix between the current and new transcript, backspacing only the changed portion, and typing the new ending
  • Smart Backspacing: Minimizes cursor movement by only removing characters that actually changed
  • Turn Management: On "EndOfTurn" events, the application clears its internal tracking but doesn't automatically press Enter, allowing users to review before submitting

About Deepgram Flux (Early Access)

  • Endpoint: wss://api.preview.deepgram.com/v2/listen
  • What it is: Flux is Deepgram's turn‑taking, low‑latency STT API designed for conversational experiences.
  • Authentication: Send an Authorization header. Common forms:
    • Token <DEEPGRAM_API_KEY> (what this app uses)
    • token <DEEPGRAM_API_KEY> or Bearer <JWT> are also accepted by the platform
  • Message types (each server message includes a JSON type field):
    • Connected — initial connection confirmation
    • TurnInfo — streaming transcription updates with fields: event (Update, StartOfTurn, Preflight, SpeechResumed, EndOfTurn), turn_index, audio_window_start, audio_window_end, transcript, words[] { word, confidence }, end_of_turn_confidence
    • Error — fatal error with fields: code, description (may also include a close code)
    • Configuration — echoes/acknowledges configuration (e.g., thresholds) when provided
  • Client close protocol: After sending your final audio, send a control message:
    • { "type": "CloseStream" } The server will flush any remaining responses and then close the WebSocket.
  • Update cadence: Flux produces updates about every 240 ms with a typical worst‑case latency of ~500 ms.
  • Common query parameters (as supported by the preview spec):
    • model, encoding, sample_rate, preflight_threshold, eot_threshold, eot_timeout_ms, keyterm, mip_opt_out, tag

Security

  • Minimal Root Time: Only root during virtual keyboard creation
  • Environment Preservation: Maintains user's audio session access
  • Clean Privilege Drop: Properly drops both user and group privileges
  • No System Changes: No permanent system configuration required

Troubleshooting

Audio Issues

If you get "Host is down" or "I/O error" when testing audio:

  1. Use sudo -E: Always preserve environment variables
  2. Check PipeWire: Ensure PipeWire is running: systemctl --user status pipewire
  3. Test without sudo: Try ./target/debug/voice-keyboard --test-audio (will fail on keyboard creation but audio should work)

Permission Issues

If you get "Permission denied" for /dev/uinput:

  1. Check uinput module: sudo modprobe uinput
  2. Verify device exists: ls -la /dev/uinput
  3. Use sudo: The application is designed to run with sudo -E

Development

Project Structure

src/
├── main.rs              # Main application and privilege dropping
├── virtual_keyboard.rs  # Virtual keyboard device management
├── audio_input.rs       # Audio capture and processing
├── stt_client.rs        # WebSocket STT client
└── input_event.rs       # Linux input event constants

Key Components

  • OriginalUser: Captures and restores user context
  • VirtualKeyboard: Manages uinput device lifecycle with smart transcript updates
  • AudioInput: Cross-platform audio capture
  • SttClient: WebSocket-based speech-to-text client
  • AudioBuffer: Manages audio chunking for STT streaming

License

ISC License. See LICENSE.txt

About

Linux virtual keyboard driver which types what you say using Deepgram Flux STT API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •