High-performance C/C++ implementation of Wav2Vec 2.0 for phoneme recognition, using the GGML tensor library.
Wav2Vec 2.0 is a self-supervised speech representation learning framework from Facebook AI Research that achieves state-of-the-art results with minimal labeled data.
Note: This project was vibe coded with an AI assistant and draws heavily from whisper.cpp.
- Plain C/C++ implementation without dependencies
- Apple Silicon first-class support (via Metal)
- Mixed F16/F32 precision
- Quantization support (Q4, Q5, Q6, Q8)
- Phoneme recognition with timing information
- CTC decoding with configurable options
mkdir build && cd build
cmake ..
make -j
# With Metal support (macOS/iOS)
cmake -DGGML_METAL=ON ..
make -j# Install dependencies
pip install torch transformers
# Convert HuggingFace model to GGML format
python models/convert-wav2vec2-to-ggml.py \
facebook/wav2vec2-lv-60-espeak-cv-ft \
models/wav2vec2-phoneme# Basic phoneme recognition
./bin/wav2vec2-cli -m models/wav2vec2-phoneme/ggml-model-f16.bin -f samples/audio.wav
# With timing information
./bin/wav2vec2-cli -m models/wav2vec2-phoneme/ggml-model-f16.bin -f samples/audio.wav --print-timestamps# Quantize to Q6_K (recommended, ~4x smaller with <5% accuracy loss)
./bin/quantize-wav2vec2 models/wav2vec2-phoneme/ggml-model-f16.bin models/wav2vec2-phoneme/ggml-model-q6_k.bin q6_kwav2vec2.cpp/
├── src/ # Core library
│ ├── wav2vec2.cpp # Main implementation
│ ├── wav2vec2-arch.h # Architecture definitions
│ └── CMakeLists.txt
├── include/
│ └── wav2vec2.h # Public C API
├── examples/
│ ├── wav2vec2/ # CLI tools
│ │ ├── wav2vec2-cli.cpp
│ │ └── quantize-wav2vec2.cpp
│ ├── common.cpp/h # Shared utilities
│ └── common-ggml.cpp/h # GGML utilities
├── models/
│ └── convert-wav2vec2-to-ggml.py
├── ggml/ # GGML tensor library
└── cmake/
#include "wav2vec2.h"
// Initialize
struct wav2vec2_context_params cparams = wav2vec2_context_default_params();
cparams.use_gpu = true;
struct wav2vec2_context * ctx = wav2vec2_init_from_file("model.bin", cparams);
// Run inference
struct wav2vec2_full_params params = wav2vec2_full_default_params();
wav2vec2_full(ctx, params, samples, n_samples);
// Get results
int n_phonemes = wav2vec2_full_n_phonemes(ctx);
for (int i = 0; i < n_phonemes; i++) {
const char * phoneme = wav2vec2_full_get_phoneme_text(ctx, i);
int64_t t0 = wav2vec2_full_get_phoneme_t0(ctx, i);
int64_t t1 = wav2vec2_full_get_phoneme_t1(ctx, i);
printf("[%lld - %lld] %s\n", t0, t1, phoneme);
}
// Cleanup
wav2vec2_free(ctx);Tested on L2-ARCTIC accented English speech samples, comparing C++ output against the HuggingFace Python reference implementation.
| Model | PER vs Python | Notes |
|---|---|---|
| F16 | 1.0% | Near-exact parity with reference |
| Q6_K | 1.4% | +0.4% degradation, 2.2x smaller |
| Q4_K | 1.7% | +0.7% degradation, 3x smaller |
PER = Phoneme Error Rate (edit distance / reference length)
| Quantization | Size | Compression |
|---|---|---|
| F16 | ~600 MB | 1x |
| Q6_K | ~270 MB | 2.2x |
| Q4_K | ~200 MB | 3x |
Q4_K is recommended for mobile deployment - significant size reduction with minimal accuracy loss.
@article{baevski2020wav2vec,
title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
journal={arXiv preprint arXiv:2006.11477},
year={2020}
}This project draws heavily from whisper.cpp by Georgi Gerganov and contributors. The architecture, build system, and many implementation patterns are adapted from that excellent project.
MIT