-
Notifications
You must be signed in to change notification settings - Fork 3k
feat: Add MLX Whisper Support for Apple Silicon ASR Pipeline #2366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add MLX Whisper Support for Apple Silicon ASR Pipeline #2366
Conversation
✅ DCO Check Passed Thanks @kensteele, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
@kensteele this is an awesome PR, thanks a ton!! |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
@kensteele I think you might need to add some restriction on the tests |
@PeterStaar-IBM Hopefully that should clear the tests! 🤞 |
@PeterStaar-IBM You bet - happy to contribute more! Looks like all the the checks have passed - possible to get a review and merge? @dolfim-ibm @cau-git |
@PeterStaar-IBM Please review the latest commit f114d45 which addresses your comments in the previous review as well as: Adds comprehensive support the following additional audio input formats:
Adds support for the following additional MIME types:
Adds additional sample audio files:
|
@PeterStaar-IBM @cau-git @dolfim-ibm Looks like I need two reviewers per the Mergify requirements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎖️
f114d45
to
712b586
Compare
@kensteele the current CI failures seem to be caused by the lock of |
…omatically if present.
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: a979a68 I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 9827068 I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: ebbeb45 I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 2f6fd3c Signed-off-by: Ken Steele <ksteele@gmail.com>
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 5e61bf1 Signed-off-by: Ken Steele <ksteele@gmail.com>
…udio/sample_10s.mp3 if no args specified. Signed-off-by: Ken Steele <ksteele@gmail.com>
…els.py - Move audio file extensions from CLI hardcoded set to FormatToExtensions[InputFormat.AUDIO] - Add support for additional audio formats: m4a, aac, ogg, flac, mp4, avi, mov - Update FormatToMimeType mapping to include MIME types for all audio formats - Update CLI auto-detection to use centralized FormatToExtensions mapping - Add comprehensive tests for audio file auto-detection and pipeline selection - Ensure explicit pipeline choices are not overridden by auto-detection Fixes issue where only .mp3 and .wav files were processed as audio despite CLI auto-detection working for all formats. The document converter now properly recognizes all audio formats through MIME type detection. Addresses review comments: - Centralizes audio extensions in base_models.py as suggested - Maintains existing auto-detection behavior while using centralized data - Adds proper test coverage for the audio detection functionality All examples and tests pass with the new centralized approach. All audio formats (mp3, wav, m4a, aac, ogg, flac, mp4, avi, mov) now work correctly. Signed-off-by: Ken Steele <ksteele@gmail.com>
…explicit model options Review feedback addressed: 1. Fix CLI auto-detection to only switch to ASR pipeline when ALL files are audio - Previously switched if ANY file was audio, now requires ALL files to be audio - Added warning for mixed file types with guidance to use --pipeline asr 2. Add explicit WHISPER_X_MLX and WHISPER_X_NATIVE model options - Users can now force specific implementations if desired - Auto-selecting models (WHISPER_BASE, etc.) still choose best for hardware - Added 12 new explicit model options: _MLX and _NATIVE variants for each size CLI now supports: - Auto-selecting: whisper_tiny, whisper_base, etc. (choose best for hardware) - Explicit MLX: whisper_tiny_mlx, whisper_base_mlx, etc. (force MLX) - Explicit Native: whisper_tiny_native, whisper_base_native, etc. (force native) Addresses reviewer comments from @dolfim-ibm Signed-off-by: Ken Steele <ksteele@gmail.com>
ffc1a57
to
fec4f33
Compare
I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: c60e72d I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 9480331 I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 21905e8 I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 96c669d I, Ken Steele <ksteele@gmail.com>, hereby add my Signed-off-by to this commit: 8371c06 Signed-off-by: Ken Steele <ksteele@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
…ompts - tests/test_asr_mlx_whisper.py: verify explicit MLX options (framework, repo ids) - tests/test_asr_pipeline.py: cover _has_text/_determine_status and backend support with proper InputDocument/NoOpBackend wiring - tests/test_interfaces.py: add BaseVlmPageModel.formulate_prompt tests (RAW/NONE/CHAT, invalid style), with minimal InlineVlmOptions scaffold Improves reliability of ASR and VLM components by validating configuration paths and helper logic. Signed-off-by: Ken Steele <ksteele@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎖️
note: there seems to be some temporary issue with the HF artifacts. we will retry launching the CI in this PR (and others) later today. |
f3a2ba2
@dolfim-ibm @PeterStaar-IBM While we're waiting on the HF artifacts issue, can I get two quick reviews/approvals on the code coverage additions I just committed to pass the codecov test @ f3a2ba2 |
Description
This PR adds comprehensive MLX Whisper support to the docling ASR pipeline, providing significant performance improvement on Apple Silicon devices through automatic hardware-aware model selection. The integration is completely transparent to users - they simply use regular Whisper models and get MLX optimization automatically when beneficial.
Issue resolved by this Pull Request:
Resolves #2364
Key Features
WHISPER_TURBO
,WHISPER_BASE
, etc.Performance Results
Actual performance comparison on Apple Silicon (M1/M2/M3) using 10-second audio sample:
Key insights:
Technical Implementation
1. MLX Framework Integration
2. Automatic Model Selection
3. CLI Auto-Detection
Documentation and Examples
Updated Examples
docs/examples/minimal_asr_pipeline.py
: Updated to show automatic model selectiondocs/examples/mlx_whisper_example.py
: New example demonstrating MLX Whisper usagedocs/examples/asr_pipeline_performance_comparison.py
: New performance comparison script with--audio
parameterUsage Examples
Python API:
CLI:
Performance Comparison:
Testing
Comprehensive Test Coverage
Test Results
Dependencies
Added MLX Whisper Dependency
asr
extra, doesn't affect core functionalityUser Experience
ASR Pipeline
docling CLI
Benefits
Files Changed
docling/datamodel/pipeline_options_asr_model.py
: Added MLX framework and optionsdocling/datamodel/asr_model_specs.py
: Implemented automatic model selectiondocling/pipeline/asr_pipeline.py
: Added MLX Whisper model implementationdocling/cli/main.py
: Added automatic pipeline detection and device configurationdocs/examples/minimal_asr_pipeline.py
: Updated documentationdocs/examples/mlx_whisper_example.py
: New MLX Whisper exampledocs/examples/asr_pipeline_performance_comparison.py
: New performance comparison scripttests/test_asr_mlx_whisper.py
: Comprehensive test suitepyproject.toml
: Added MLX Whisper dependencyuv.lock
: Updated dependency lock fileChecklist
Conclusion
This PR delivers a complete, transparent MLX Whisper integration that provides significant performance improvements on Apple Silicon while maintaining full backward compatibility. Users get the benefits of MLX optimization without any configuration changes, making it a true "just works" enhancement to the docling ASR pipeline.