Skip to content

Conversation

kensteele
Copy link

Description

This PR adds comprehensive MLX Whisper support to the docling ASR pipeline, providing significant performance improvement on Apple Silicon devices through automatic hardware-aware model selection. The integration is completely transparent to users - they simply use regular Whisper models and get MLX optimization automatically when beneficial.

Issue resolved by this Pull Request:
Resolves #2364

Key Features

  • Automatic Hardware Detection: Detects MPS (Apple Silicon) and MLX Whisper availability
  • Transparent Integration: Users use regular WHISPER_TURBO, WHISPER_BASE, etc.
  • Smart Fallback: Falls back to native Whisper on non-Apple Silicon systems
  • Complete Model Coverage: All Whisper model sizes support automatic MLX selection
  • CLI Enhancement: Automatic pipeline detection for audio files
  • Type Safety: Proper type annotations and MyPy compliance

Performance Results

Actual performance comparison on Apple Silicon (M1/M2/M3) using 10-second audio sample:

Model Native Whisper (CPU) MLX Whisper (MPS) Speedup
whisper_tiny 1.24 sec 0.81 sec 1.5x faster
whisper_base 8.55 sec 0.45 sec 19.0x faster
whisper_turbo 9.50 sec 1.26 sec 7.6x faster
Average 6.43 sec 0.84 sec 7.7x faster

Key insights:

  • MLX Whisper provides significant speedup across all model sizes
  • Larger models (base, turbo) show the most dramatic improvements
  • The 10-second audio sample completes in under 1 second with MLX Whisper base/turbo

Technical Implementation

1. MLX Framework Integration

class InferenceAsrFramework(str, Enum):
    MLX = "mlx"  # Now enabled
    WHISPER = "whisper"

2. Automatic Model Selection

def _get_whisper_turbo_model():
    """Get the best Whisper Turbo model for the current hardware."""
    # Check if MPS is available (Apple Silicon)
    try:
        import torch
        has_mps = torch.backends.mps.is_built() and torch.backends.mps.is_available()
    except ImportError:
        has_mps = False
    
    # Check if mlx-whisper is available
    try:
        import mlx_whisper
        has_mlx_whisper = True
    except ImportError:
        has_mlx_whisper = False
    
    # Use MLX Whisper if both MPS and mlx-whisper are available
    if has_mps and has_mlx_whisper:
        return InlineAsrMlxWhisperOptions(
            repo_id="mlx-community/whisper-turbo",
            inference_framework=InferenceAsrFramework.MLX,
            # ... MLX-specific options
        )
    else:
        return InlineAsrNativeWhisperOptions(
            repo_id="turbo",
            inference_framework=InferenceAsrFramework.WHISPER,
            # ... Native Whisper options
        )

3. CLI Auto-Detection

# Auto-detect pipeline based on input file formats
if pipeline == ProcessingPipeline.STANDARD:
    # Check if any input files are audio files by extension
    audio_extensions = {'.mp3', '.wav', '.m4a', '.aac', '.ogg', '.flac', '.mp4', '.avi', '.mov'}
    for path in input_doc_paths:
        if path.suffix.lower() in audio_extensions:
            pipeline = ProcessingPipeline.ASR
            _log.info(f"Auto-detected ASR pipeline for audio file: {path}")
            break

Documentation and Examples

Updated Examples

  • docs/examples/minimal_asr_pipeline.py: Updated to show automatic model selection
  • docs/examples/mlx_whisper_example.py: New example demonstrating MLX Whisper usage
  • docs/examples/asr_pipeline_performance_comparison.py: New performance comparison script with --audio parameter

Usage Examples

Python API:

from docling.datamodel import asr_model_specs
from docling.datamodel.pipeline_options import AsrPipelineOptions

# Automatically uses MLX Whisper on Apple Silicon!
pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

CLI:

# Automatically detects audio files and uses ASR pipeline with MLX Whisper!
docling ~/Recording1.mp3 --asr-model whisper_turbo
docling ~/Recording1.mp3 --asr-model whisper_medium
docling ~/Recording1.mp3 --asr-model whisper_base

Performance Comparison:

# Use default test audio file
python docs/examples/asr_pipeline_performance_comparison.py

# Use your own audio file
python docs/examples/asr_pipeline_performance_comparison.py --audio ~/Recording1.mp3

# Show help
python docs/examples/asr_pipeline_performance_comparison.py --help

Testing

Comprehensive Test Coverage

  • MLX Whisper model initialization
  • Automatic model selection logic
  • Import error handling
  • Transcription functionality
  • Pipeline integration
  • CLI model selection

Test Results

$ python -m pytest tests/test_asr_mlx_whisper.py tests/test_asr_pipeline.py -v
============================= test session starts ==============================
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_options_creation PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_whisper_models_auto_select_mlx PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_model_initialization PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_model_import_error PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_mlx_whisper_transcribe PASSED
tests/test_asr_mlx_whisper.py::TestMlxWhisperIntegration::test_asr_pipeline_with_mlx_whisper PASSED
tests/test_asr_pipeline.py::test_asr_pipeline_conversion PASSED
============================== 7 passed in 3.49s ===============================

Dependencies

Added MLX Whisper Dependency

# pyproject.toml
asr = [
    'mlx-whisper>=0.4.3 ; python_version >= "3.10" and sys_platform == "darwin" and platform_machine == "arm64"',
    "openai-whisper>=20250625",
]
  • Platform Specific: Only installed on Apple Silicon (arm64) macOS systems
  • Python Version: Requires Python 3.10+ (MLX requirement)
  • Optional: Part of the asr extra, doesn't affect core functionality

User Experience

ASR Pipeline

# Users just use regular Whisper models - MLX is automatic!
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO  # Works everywhere

docling CLI

# Audio files automatically trigger ASR pipeline with MLX optimization
docling ~/Recording1.mp3 --asr-model whisper_turbo

Benefits

  1. Performance: Up to 19x faster ASR on Apple Silicon devices
  2. Transparency: No user configuration required
  3. Compatibility: Works on all platforms with appropriate fallbacks
  4. Scalability: Supports all Whisper model sizes
  5. Reliability: Comprehensive error handling and testing
  6. Documentation: Clear examples and usage patterns

Files Changed

  • docling/datamodel/pipeline_options_asr_model.py: Added MLX framework and options
  • docling/datamodel/asr_model_specs.py: Implemented automatic model selection
  • docling/pipeline/asr_pipeline.py: Added MLX Whisper model implementation
  • docling/cli/main.py: Added automatic pipeline detection and device configuration
  • docs/examples/minimal_asr_pipeline.py: Updated documentation
  • docs/examples/mlx_whisper_example.py: New MLX Whisper example
  • docs/examples/asr_pipeline_performance_comparison.py: New performance comparison script
  • tests/test_asr_mlx_whisper.py: Comprehensive test suite
  • pyproject.toml: Added MLX Whisper dependency
  • uv.lock: Updated dependency lock file

Checklist

  • Examples have been added
  • Tests have been added
  • Pre-commit checks pass (Ruff formatter, Ruff linter, MyPy, uv-lock)
  • All tests pass (7/7 tests successful)
  • Type safety ensured with proper annotations
  • Backward compatibility maintained
  • Platform-specific dependencies properly configured

Conclusion

This PR delivers a complete, transparent MLX Whisper integration that provides significant performance improvements on Apple Silicon while maintaining full backward compatibility. Users get the benefits of MLX optimization without any configuration changes, making it a true "just works" enhancement to the docling ASR pipeline.

Copy link
Contributor

github-actions bot commented Oct 2, 2025

DCO Check Passed

Thanks @kensteele, all your commits are properly signed off. 🎉

Copy link

dosubot bot commented Oct 2, 2025

Related Documentation

Checked 3 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Copy link

mergify bot commented Oct 2, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@PeterStaar-IBM
Copy link
Contributor

@kensteele this is an awesome PR, thanks a ton!!

Copy link

codecov bot commented Oct 3, 2025

Codecov Report

❌ Patch coverage is 71.02804% with 62 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/cli/main.py 30.55% 25 Missing ⚠️
docling/datamodel/asr_model_specs.py 78.94% 24 Missing ⚠️
docling/pipeline/asr_pipeline.py 75.92% 13 Missing ⚠️

📢 Thoughts on this report? Let us know!

@PeterStaar-IBM
Copy link
Contributor

@kensteele I think you might need to add some restriction on the tests docs/examples/asr_pipeline_performance_comparison.py for the unit-tests to go through.

@kensteele
Copy link
Author

@kensteele I think you might need to add some restriction on the tests docs/examples/asr_pipeline_performance_comparison.py for the unit-tests to go through.

@PeterStaar-IBM Hopefully that should clear the tests! 🤞

@kensteele
Copy link
Author

@kensteele this is an awesome PR, thanks a ton!!

@PeterStaar-IBM You bet - happy to contribute more!

Looks like all the the checks have passed - possible to get a review and merge? @dolfim-ibm @cau-git

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Oct 6, 2025
@kensteele
Copy link
Author

kensteele commented Oct 7, 2025

@PeterStaar-IBM Please review the latest commit f114d45 which addresses your comments in the previous review as well as:

Adds comprehensive support the following additional audio input formats:

  • m4a
  • aac
  • ogg
  • flac
  • mp4
  • avi
  • mov

Adds support for the following additional MIME types:

  • audio/mp4
  • audio/m4a
  • audio/aac
  • audio/ogg
  • audio/flac
  • audio/x-flac
  • video/mp4
  • video/avi
  • video/x-msvideo
  • video/quicktime

Adds additional sample audio files:

  • audio and video files added to tests/data/audio/ for ASR testing:
sample_10s_audio-aac.aac
sample_10s_audio-flac.flac
sample_10s_audio-m4a.m4a
sample_10s_audio-mp3.mp3
sample_10s_audio-mp4.m4a
sample_10s_audio-mpeg.mp3
sample_10s_audio-ogg.ogg
sample_10s_audio-wav.wav
sample_10s_audio-x-flac.flac
sample_10s_audio-x-wav.wav

sample_10s_video-avi.avi
sample_10s_video-mp4.mp4
sample_10s_video-quicktime.mov
sample_10s_video-x-msvideo.avi

@kensteele
Copy link
Author

@PeterStaar-IBM @cau-git @dolfim-ibm Looks like I need two reviewers per the Mergify requirements: When test data is updated, we require two reviewers

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Oct 10, 2025
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️

@dolfim-ibm
Copy link
Contributor

@kensteele the current CI failures seem to be caused by the lock of av a new version which only has wheels for Python 3.10. I suggest pinning av<16.0.0

I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: a979a68
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 9827068
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: ebbeb45
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 2f6fd3c

Signed-off-by: Ken Steele <ksteele@gmail.com>
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 5e61bf1

Signed-off-by: Ken Steele <ksteele@gmail.com>
…udio/sample_10s.mp3 if no args specified.

Signed-off-by: Ken Steele <ksteele@gmail.com>
…els.py

- Move audio file extensions from CLI hardcoded set to FormatToExtensions[InputFormat.AUDIO]
- Add support for additional audio formats: m4a, aac, ogg, flac, mp4, avi, mov
- Update FormatToMimeType mapping to include MIME types for all audio formats
- Update CLI auto-detection to use centralized FormatToExtensions mapping
- Add comprehensive tests for audio file auto-detection and pipeline selection
- Ensure explicit pipeline choices are not overridden by auto-detection

Fixes issue where only .mp3 and .wav files were processed as audio despite
CLI auto-detection working for all formats. The document converter now
properly recognizes all audio formats through MIME type detection.

Addresses review comments:
- Centralizes audio extensions in base_models.py as suggested
- Maintains existing auto-detection behavior while using centralized data
- Adds proper test coverage for the audio detection functionality

All examples and tests pass with the new centralized approach.
All audio formats (mp3, wav, m4a, aac, ogg, flac, mp4, avi, mov) now work correctly.

Signed-off-by: Ken Steele <ksteele@gmail.com>
…explicit model options

Review feedback addressed:
1. Fix CLI auto-detection to only switch to ASR pipeline when ALL files are audio
   - Previously switched if ANY file was audio, now requires ALL files to be audio
   - Added warning for mixed file types with guidance to use --pipeline asr

2. Add explicit WHISPER_X_MLX and WHISPER_X_NATIVE model options
   - Users can now force specific implementations if desired
   - Auto-selecting models (WHISPER_BASE, etc.) still choose best for hardware
   - Added 12 new explicit model options: _MLX and _NATIVE variants for each size

CLI now supports:
- Auto-selecting: whisper_tiny, whisper_base, etc. (choose best for hardware)
- Explicit MLX: whisper_tiny_mlx, whisper_base_mlx, etc. (force MLX)
- Explicit Native: whisper_tiny_native, whisper_base_native, etc. (force native)

Addresses reviewer comments from @dolfim-ibm

Signed-off-by: Ken Steele <ksteele@gmail.com>
@kensteele kensteele force-pushed the dev/add-mlx-whisper-support branch from ffc1a57 to fec4f33 Compare October 18, 2025 18:02
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: c60e72d
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 9480331
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 21905e8
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 96c669d
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 8371c06

Signed-off-by: Ken Steele <ksteele@gmail.com>
dolfim-ibm
dolfim-ibm previously approved these changes Oct 20, 2025
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

…ompts

- tests/test_asr_mlx_whisper.py: verify explicit MLX options (framework, repo ids)
- tests/test_asr_pipeline.py: cover _has_text/_determine_status and backend support with proper InputDocument/NoOpBackend wiring
- tests/test_interfaces.py: add BaseVlmPageModel.formulate_prompt tests (RAW/NONE/CHAT, invalid style), with minimal InlineVlmOptions scaffold

Improves reliability of ASR and VLM components by validating configuration paths and helper logic.

Signed-off-by: Ken Steele <ksteele@gmail.com>
PeterStaar-IBM
PeterStaar-IBM previously approved these changes Oct 20, 2025
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️

@dolfim-ibm
Copy link
Contributor

note: there seems to be some temporary issue with the HF artifacts. we will retry launching the CI in this PR (and others) later today.

@kensteele kensteele dismissed stale reviews from PeterStaar-IBM and dolfim-ibm via f3a2ba2 October 20, 2025 08:50
@kensteele
Copy link
Author

note: there seems to be some temporary issue with the HF artifacts. we will retry launching the CI in this PR (and others) later today.

@dolfim-ibm @PeterStaar-IBM While we're waiting on the HF artifacts issue, can I get two quick reviews/approvals on the code coverage additions I just committed to pass the codecov test @ f3a2ba2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add MLX Whisper Support for Apple Silicon ASR Pipeline

3 participants