feat: Add MLX Whisper Support for Apple Silicon ASR Pipeline #2366

kensteele · 2025-10-02T13:02:59Z

Description

This PR adds comprehensive MLX Whisper support to the docling ASR pipeline, providing significant performance improvement on Apple Silicon devices through automatic hardware-aware model selection. The integration is completely transparent to users - they simply use regular Whisper models and get MLX optimization automatically when beneficial.

Issue resolved by this Pull Request:
Resolves #2364

Key Features

Automatic Hardware Detection: Detects MPS (Apple Silicon) and MLX Whisper availability
Transparent Integration: Users use regular WHISPER_TURBO, WHISPER_BASE, etc.
Smart Fallback: Falls back to native Whisper on non-Apple Silicon systems
Complete Model Coverage: All Whisper model sizes support automatic MLX selection
CLI Enhancement: Automatic pipeline detection for audio files
Type Safety: Proper type annotations and MyPy compliance

Performance Results

Actual performance comparison on Apple Silicon (M1/M2/M3) using 10-second audio sample:

Model	Native Whisper (CPU)	MLX Whisper (MPS)	Speedup
whisper_tiny	1.24 sec	0.81 sec	1.5x faster
whisper_base	8.55 sec	0.45 sec	19.0x faster
whisper_turbo	9.50 sec	1.26 sec	7.6x faster
Average	6.43 sec	0.84 sec	7.7x faster

Key insights:

MLX Whisper provides significant speedup across all model sizes
Larger models (base, turbo) show the most dramatic improvements
The 10-second audio sample completes in under 1 second with MLX Whisper base/turbo

Technical Implementation

1. MLX Framework Integration

class InferenceAsrFramework(str, Enum):
    MLX = "mlx"  # Now enabled
    WHISPER = "whisper"

2. Automatic Model Selection

def _get_whisper_turbo_model():
    """Get the best Whisper Turbo model for the current hardware."""
    # Check if MPS is available (Apple Silicon)
    try:
        import torch
        has_mps = torch.backends.mps.is_built() and torch.backends.mps.is_available()
    except ImportError:
        has_mps = False
    
    # Check if mlx-whisper is available
    try:
        import mlx_whisper
        has_mlx_whisper = True
    except ImportError:
        has_mlx_whisper = False
    
    # Use MLX Whisper if both MPS and mlx-whisper are available
    if has_mps and has_mlx_whisper:
        return InlineAsrMlxWhisperOptions(
            repo_id="mlx-community/whisper-turbo",
            inference_framework=InferenceAsrFramework.MLX,
            # ... MLX-specific options
        )
    else:
        return InlineAsrNativeWhisperOptions(
            repo_id="turbo",
            inference_framework=InferenceAsrFramework.WHISPER,
            # ... Native Whisper options
        )

3. CLI Auto-Detection

# Auto-detect pipeline based on input file formats
if pipeline == ProcessingPipeline.STANDARD:
    # Check if any input files are audio files by extension
    audio_extensions = {'.mp3', '.wav', '.m4a', '.aac', '.ogg', '.flac', '.mp4', '.avi', '.mov'}
    for path in input_doc_paths:
        if path.suffix.lower() in audio_extensions:
            pipeline = ProcessingPipeline.ASR
            _log.info(f"Auto-detected ASR pipeline for audio file: {path}")
            break

Documentation and Examples

Updated Examples

docs/examples/minimal_asr_pipeline.py: Updated to show automatic model selection
docs/examples/mlx_whisper_example.py: New example demonstrating MLX Whisper usage
docs/examples/asr_pipeline_performance_comparison.py: New performance comparison script with --audio parameter

Usage Examples

Python API:

from docling.datamodel import asr_model_specs
from docling.datamodel.pipeline_options import AsrPipelineOptions

# Automatically uses MLX Whisper on Apple Silicon!
pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

CLI:

# Automatically detects audio files and uses ASR pipeline with MLX Whisper!
docling ~/Recording1.mp3 --asr-model whisper_turbo
docling ~/Recording1.mp3 --asr-model whisper_medium
docling ~/Recording1.mp3 --asr-model whisper_base

Performance Comparison:

# Use default test audio file
python docs/examples/asr_pipeline_performance_comparison.py

# Use your own audio file
python docs/examples/asr_pipeline_performance_comparison.py --audio ~/Recording1.mp3

# Show help
python docs/examples/asr_pipeline_performance_comparison.py --help

Testing

Comprehensive Test Coverage

MLX Whisper model initialization
Automatic model selection logic
Import error handling
Transcription functionality
Pipeline integration
CLI model selection

Test Results

$ python -m pytest tests/test_asr_mlx_whisper.py tests/test_asr_pipeline.py -v
============================= test session starts ==============================
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_options_creation PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_whisper_models_auto_select_mlx PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_model_initialization PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_model_import_error PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_transcribe PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_asr_pipeline_with_mlx_whisper PASSED
tests/test_asr_pipeline.py::test_asr_pipeline_conversion PASSED
============================== 7 passed in 3.49s ===============================

Dependencies

Added MLX Whisper Dependency

# pyproject.toml
asr = [
    'mlx-whisper>=0.4.3 ; python_version >= "3.10" and sys_platform == "darwin" and platform_machine == "arm64"',
    "openai-whisper>=20250625",
]

Platform Specific: Only installed on Apple Silicon (arm64) macOS systems
Python Version: Requires Python 3.10+ (MLX requirement)
Optional: Part of the asr extra, doesn't affect core functionality

User Experience

ASR Pipeline

# Users just use regular Whisper models - MLX is automatic!
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO  # Works everywhere

docling CLI

# Audio files automatically trigger ASR pipeline with MLX optimization
docling ~/Recording1.mp3 --asr-model whisper_turbo

Benefits

Performance: Up to 19x faster ASR on Apple Silicon devices
Transparency: No user configuration required
Compatibility: Works on all platforms with appropriate fallbacks
Scalability: Supports all Whisper model sizes
Reliability: Comprehensive error handling and testing
Documentation: Clear examples and usage patterns

Files Changed

docling/datamodel/pipeline_options_asr_model.py: Added MLX framework and options
docling/datamodel/asr_model_specs.py: Implemented automatic model selection
docling/pipeline/asr_pipeline.py: Added MLX Whisper model implementation
docling/cli/main.py: Added automatic pipeline detection and device configuration
docs/examples/minimal_asr_pipeline.py: Updated documentation
docs/examples/mlx_whisper_example.py: New MLX Whisper example
docs/examples/asr_pipeline_performance_comparison.py: New performance comparison script
tests/test_asr_mlx_whisper.py: Comprehensive test suite
pyproject.toml: Added MLX Whisper dependency
uv.lock: Updated dependency lock file

Checklist

Examples have been added
Tests have been added
Pre-commit checks pass (Ruff formatter, Ruff linter, MyPy, uv-lock)
All tests pass (7/7 tests successful)
Type safety ensured with proper annotations
Backward compatibility maintained
Platform-specific dependencies properly configured

Conclusion

This PR delivers a complete, transparent MLX Whisper integration that provides significant performance improvements on Apple Silicon while maintaining full backward compatibility. Users get the benefits of MLX optimization without any configuration changes, making it a true "just works" enhancement to the docling ASR pipeline.

github-actions · 2025-10-02T13:03:10Z

✅ DCO Check Passed

Thanks @kensteele, all your commits are properly signed off. 🎉

dosubot · 2025-10-02T13:03:23Z

Related Documentation

Checked 3 published document(s). No updates required.

^{How did I do? Any feedback?}

mergify · 2025-10-02T13:03:34Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

PeterStaar-IBM · 2025-10-03T05:17:47Z

@kensteele this is an awesome PR, thanks a ton!!

codecov · 2025-10-03T05:40:02Z

Codecov Report

❌ Patch coverage is 71.02804% with 62 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/cli/main.py	30.55%	25 Missing ⚠️
docling/datamodel/asr_model_specs.py	78.94%	24 Missing ⚠️
docling/pipeline/asr_pipeline.py	75.92%	13 Missing ⚠️

📢 Thoughts on this report? Let us know!

PeterStaar-IBM · 2025-10-03T06:52:27Z

@kensteele I think you might need to add some restriction on the tests docs/examples/asr_pipeline_performance_comparison.py for the unit-tests to go through.

kensteele · 2025-10-04T10:51:11Z

@kensteele I think you might need to add some restriction on the tests docs/examples/asr_pipeline_performance_comparison.py for the unit-tests to go through.

@PeterStaar-IBM Hopefully that should clear the tests! 🤞

kensteele · 2025-10-05T21:52:24Z

@kensteele this is an awesome PR, thanks a ton!!

@PeterStaar-IBM You bet - happy to contribute more!

Looks like all the the checks have passed - possible to get a review and merge? @dolfim-ibm @cau-git

docling/cli/main.py

kensteele · 2025-10-07T22:38:23Z

@PeterStaar-IBM Please review the latest commit f114d45 which addresses your comments in the previous review as well as:

Adds comprehensive support the following additional audio input formats:

m4a
aac
ogg
flac
mp4
avi
mov

Adds support for the following additional MIME types:

audio/mp4
audio/m4a
audio/aac
audio/ogg
audio/flac
audio/x-flac
video/mp4
video/avi
video/x-msvideo
video/quicktime

Adds additional sample audio files:

audio and video files added to tests/data/audio/ for ASR testing:

sample_10s_audio-aac.aac
sample_10s_audio-flac.flac
sample_10s_audio-m4a.m4a
sample_10s_audio-mp3.mp3
sample_10s_audio-mp4.m4a
sample_10s_audio-mpeg.mp3
sample_10s_audio-ogg.ogg
sample_10s_audio-wav.wav
sample_10s_audio-x-flac.flac
sample_10s_audio-x-wav.wav

sample_10s_video-avi.avi
sample_10s_video-mp4.mp4
sample_10s_video-quicktime.mov
sample_10s_video-x-msvideo.avi

kensteele · 2025-10-07T22:42:07Z

@PeterStaar-IBM @cau-git @dolfim-ibm Looks like I need two reviewers per the Mergify requirements: When test data is updated, we require two reviewers

PeterStaar-IBM

🎖️

dolfim-ibm · 2025-10-13T08:47:57Z

@kensteele the current CI failures seem to be caused by the lock of av a new version which only has wheels for Python 3.10. I suggest pinning av<16.0.0

docling/cli/main.py

docling/datamodel/asr_model_specs.py

…omatically if present.

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: a979a68 I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 9827068 I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: ebbeb45 I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 2f6fd3c Signed-off-by: Ken Steele <ksteele@gmail.com>

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 5e61bf1 Signed-off-by: Ken Steele <ksteele@gmail.com>

…udio/sample_10s.mp3 if no args specified. Signed-off-by: Ken Steele <ksteele@gmail.com>

…els.py - Move audio file extensions from CLI hardcoded set to FormatToExtensions[InputFormat.AUDIO] - Add support for additional audio formats: m4a, aac, ogg, flac, mp4, avi, mov - Update FormatToMimeType mapping to include MIME types for all audio formats - Update CLI auto-detection to use centralized FormatToExtensions mapping - Add comprehensive tests for audio file auto-detection and pipeline selection - Ensure explicit pipeline choices are not overridden by auto-detection Fixes issue where only .mp3 and .wav files were processed as audio despite CLI auto-detection working for all formats. The document converter now properly recognizes all audio formats through MIME type detection. Addresses review comments: - Centralizes audio extensions in base_models.py as suggested - Maintains existing auto-detection behavior while using centralized data - Adds proper test coverage for the audio detection functionality All examples and tests pass with the new centralized approach. All audio formats (mp3, wav, m4a, aac, ogg, flac, mp4, avi, mov) now work correctly. Signed-off-by: Ken Steele <ksteele@gmail.com>

@dolfim-ibm

…explicit model options Review feedback addressed: 1. Fix CLI auto-detection to only switch to ASR pipeline when ALL files are audio - Previously switched if ANY file was audio, now requires ALL files to be audio - Added warning for mixed file types with guidance to use --pipeline asr 2. Add explicit WHISPER_X_MLX and WHISPER_X_NATIVE model options - Users can now force specific implementations if desired - Auto-selecting models (WHISPER_BASE, etc.) still choose best for hardware - Added 12 new explicit model options: _MLX and _NATIVE variants for each size CLI now supports: - Auto-selecting: whisper_tiny, whisper_base, etc. (choose best for hardware) - Explicit MLX: whisper_tiny_mlx, whisper_base_mlx, etc. (force MLX) - Explicit Native: whisper_tiny_native, whisper_base_native, etc. (force native) Addresses reviewer comments from @dolfim-ibm Signed-off-by: Ken Steele <ksteele@gmail.com>

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: c60e72d I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 9480331 I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 21905e8 I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 96c669d I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 8371c06 Signed-off-by: Ken Steele <ksteele@gmail.com>

dolfim-ibm

lgtm

…ompts - tests/test_asr_mlx_whisper.py: verify explicit MLX options (framework, repo ids) - tests/test_asr_pipeline.py: cover _has_text/_determine_status and backend support with proper InputDocument/NoOpBackend wiring - tests/test_interfaces.py: add BaseVlmPageModel.formulate_prompt tests (RAW/NONE/CHAT, invalid style), with minimal InlineVlmOptions scaffold Improves reliability of ASR and VLM components by validating configuration paths and helper logic. Signed-off-by: Ken Steele <ksteele@gmail.com>

PeterStaar-IBM

🎖️

dolfim-ibm · 2025-10-20T08:31:39Z

note: there seems to be some temporary issue with the HF artifacts. we will retry launching the CI in this PR (and others) later today.

kensteele · 2025-10-20T08:57:59Z

note: there seems to be some temporary issue with the HF artifacts. we will retry launching the CI in this PR (and others) later today.

@dolfim-ibm @PeterStaar-IBM While we're waiting on the HF artifacts issue, can I get two quick reviews/approvals on the code coverage additions I just committed to pass the codecov test @ f3a2ba2

PeterStaar-IBM requested review from PeterStaar-IBM, cau-git and dolfim-ibm October 3, 2025 05:16

PeterStaar-IBM previously approved these changes Oct 6, 2025

View reviewed changes

docling/cli/main.py Outdated Show resolved Hide resolved

docling/cli/main.py Outdated Show resolved Hide resolved

kensteele dismissed PeterStaar-IBM’s stale review via f114d45 October 7, 2025 22:22

kensteele requested a review from PeterStaar-IBM October 7, 2025 22:56

PeterStaar-IBM previously approved these changes Oct 10, 2025

View reviewed changes

kensteele dismissed PeterStaar-IBM’s stale review via 712b586 October 13, 2025 00:34

kensteele force-pushed the dev/add-mlx-whisper-support branch from f114d45 to 712b586 Compare October 13, 2025 00:34

dolfim-ibm reviewed Oct 14, 2025

View reviewed changes

docling/cli/main.py Show resolved Hide resolved

docling/datamodel/asr_model_specs.py Outdated Show resolved Hide resolved

kensteele added 9 commits October 18, 2025 11:00

add mlx-whisper support

c60e72d

added mlx-whisper example and test. update docling cli to use MLX aut…

9480331

…omatically if present.

fix pre-commit checks and added proper type safety

21905e8

fixed linter issue

96c669d

fix unit tests and code coverage for CI

8371c06

DCO Remediation Commit for Ken Steele <ksteele@gmail.com>

fd8fd44

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 5e61bf1 Signed-off-by: Ken Steele <ksteele@gmail.com>

fix CI example test - mlx_whisper_example.py defaults to tests/data/a…

9f60a44

…udio/sample_10s.mp3 if no args specified. Signed-off-by: Ken Steele <ksteele@gmail.com>

kensteele force-pushed the dev/add-mlx-whisper-support branch from ffc1a57 to fec4f33 Compare October 18, 2025 18:02

kensteele requested review from PeterStaar-IBM and dolfim-ibm October 18, 2025 18:05

dolfim-ibm previously approved these changes Oct 20, 2025

View reviewed changes

PeterStaar-IBM previously approved these changes Oct 20, 2025

View reviewed changes

kensteele dismissed stale reviews from PeterStaar-IBM and dolfim-ibm via f3a2ba2 October 20, 2025 08:50

feat: Add MLX Whisper Support for Apple Silicon ASR Pipeline #2366

Are you sure you want to change the base?

feat: Add MLX Whisper Support for Apple Silicon ASR Pipeline #2366

Conversation

kensteele commented Oct 2, 2025

Description

Key Features

Performance Results

Technical Implementation

1. MLX Framework Integration

2. Automatic Model Selection

3. CLI Auto-Detection

Documentation and Examples

Updated Examples

Usage Examples

Testing

Comprehensive Test Coverage

Test Results

Dependencies

Added MLX Whisper Dependency

User Experience

ASR Pipeline

docling CLI

Benefits

Files Changed

Checklist

Conclusion

Uh oh!

github-actions bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

PeterStaar-IBM commented Oct 3, 2025

Uh oh!

codecov bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PeterStaar-IBM commented Oct 3, 2025

Uh oh!

kensteele commented Oct 4, 2025

Uh oh!

kensteele commented Oct 5, 2025

Uh oh!

Uh oh!

Uh oh!

kensteele commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adds comprehensive support the following additional audio input formats:

Adds support for the following additional MIME types:

Adds additional sample audio files:

Uh oh!

kensteele commented Oct 7, 2025

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

dolfim-ibm commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

dolfim-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

dolfim-ibm commented Oct 20, 2025

Uh oh!

kensteele commented Oct 20, 2025

Uh oh!

Reviewers

github-actions bot commented Oct 2, 2025 •

edited

Loading

dosubot bot commented Oct 2, 2025 •

edited

Loading

mergify bot commented Oct 2, 2025 •

edited

Loading

codecov bot commented Oct 3, 2025 •

edited

Loading

kensteele commented Oct 7, 2025 •

edited

Loading