Skip to content

engineerchuan/wav2vec2.cpp

Repository files navigation

wav2vec2.cpp

High-performance C/C++ implementation of Wav2Vec 2.0 for phoneme recognition, using the GGML tensor library.

Wav2Vec 2.0 is a self-supervised speech representation learning framework from Facebook AI Research that achieves state-of-the-art results with minimal labeled data.

Note: This project was vibe coded with an AI assistant and draws heavily from whisper.cpp.

Features

  • Plain C/C++ implementation without dependencies
  • Apple Silicon first-class support (via Metal)
  • Mixed F16/F32 precision
  • Quantization support (Q4, Q5, Q6, Q8)
  • Phoneme recognition with timing information
  • CTC decoding with configurable options

Quick Start

Build

mkdir build && cd build
cmake ..
make -j

# With Metal support (macOS/iOS)
cmake -DGGML_METAL=ON ..
make -j

Convert Model

# Install dependencies
pip install torch transformers

# Convert HuggingFace model to GGML format
python models/convert-wav2vec2-to-ggml.py \
    facebook/wav2vec2-lv-60-espeak-cv-ft \
    models/wav2vec2-phoneme

Run

# Basic phoneme recognition
./bin/wav2vec2-cli -m models/wav2vec2-phoneme/ggml-model-f16.bin -f samples/audio.wav

# With timing information
./bin/wav2vec2-cli -m models/wav2vec2-phoneme/ggml-model-f16.bin -f samples/audio.wav --print-timestamps

Quantize

# Quantize to Q6_K (recommended, ~4x smaller with <5% accuracy loss)
./bin/quantize-wav2vec2 models/wav2vec2-phoneme/ggml-model-f16.bin models/wav2vec2-phoneme/ggml-model-q6_k.bin q6_k

Project Structure

wav2vec2.cpp/
├── src/                    # Core library
│   ├── wav2vec2.cpp       # Main implementation
│   ├── wav2vec2-arch.h    # Architecture definitions
│   └── CMakeLists.txt
├── include/
│   └── wav2vec2.h         # Public C API
├── examples/
│   ├── wav2vec2/          # CLI tools
│   │   ├── wav2vec2-cli.cpp
│   │   └── quantize-wav2vec2.cpp
│   ├── common.cpp/h       # Shared utilities
│   └── common-ggml.cpp/h  # GGML utilities
├── models/
│   └── convert-wav2vec2-to-ggml.py
├── ggml/                   # GGML tensor library
└── cmake/

API Usage

#include "wav2vec2.h"

// Initialize
struct wav2vec2_context_params cparams = wav2vec2_context_default_params();
cparams.use_gpu = true;

struct wav2vec2_context * ctx = wav2vec2_init_from_file("model.bin", cparams);

// Run inference
struct wav2vec2_full_params params = wav2vec2_full_default_params();
wav2vec2_full(ctx, params, samples, n_samples);

// Get results
int n_phonemes = wav2vec2_full_n_phonemes(ctx);
for (int i = 0; i < n_phonemes; i++) {
    const char * phoneme = wav2vec2_full_get_phoneme_text(ctx, i);
    int64_t t0 = wav2vec2_full_get_phoneme_t0(ctx, i);
    int64_t t1 = wav2vec2_full_get_phoneme_t1(ctx, i);
    printf("[%lld - %lld] %s\n", t0, t1, phoneme);
}

// Cleanup
wav2vec2_free(ctx);

Evaluation

Tested on L2-ARCTIC accented English speech samples, comparing C++ output against the HuggingFace Python reference implementation.

Accuracy vs Reference

Model PER vs Python Notes
F16 1.0% Near-exact parity with reference
Q6_K 1.4% +0.4% degradation, 2.2x smaller
Q4_K 1.7% +0.7% degradation, 3x smaller

PER = Phoneme Error Rate (edit distance / reference length)

Model Size

Quantization Size Compression
F16 ~600 MB 1x
Q6_K ~270 MB 2.2x
Q4_K ~200 MB 3x

Q4_K is recommended for mobile deployment - significant size reduction with minimal accuracy loss.

References

@article{baevski2020wav2vec,
  title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
  author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
  journal={arXiv preprint arXiv:2006.11477},
  year={2020}
}

Acknowledgments

This project draws heavily from whisper.cpp by Georgi Gerganov and contributors. The architecture, build system, and many implementation patterns are adapted from that excellent project.

License

MIT

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published