-
Notifications
You must be signed in to change notification settings - Fork 12.9k
mtmd : add support for Voxtral #14862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
cd909c2
mtmd : add support for Voxtral
ngxson 5fc3507
clean up
ngxson 2da31ed
fix python requirements
ngxson 49045bd
Merge branch 'master' into xsn/voxtral
ngxson 97119dd
add [BEGIN_AUDIO] token
ngxson b828887
also support Devstral conversion
ngxson 738be19
add docs and tests
ngxson 8b2d72d
fix regression for ultravox
ngxson 01bf687
minor coding style improvement
ngxson 4556b40
correct project activation fn
ngxson 8c543f7
Apply suggestions from code review
ngxson File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that mistral wants to delegates chat formatting to their python library
mistral-common
make this part a bit tricky.. chat templates are no longer stored inside HF repo, even withtransformers
model. Therefore, we have no better way but to store a jinja version somewhere inside llama.cpp for compatibility reasonThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick look at the tests in
mistral-common
seems to show that they've changed the chat template a bit for Voxtral to accommodate audio, including some new tokens to wrap the input audio (not sure if those are being added here or not? Not super familiar withmtmd
).It also seems like Mistral adds a special token for transcription tasks, which seems to be an optional language and then a
[TRANSCRIBE]
token based on the tests.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah we're missing the
[BEGIN_AUDIO]
token. I have no idea if[TRANSCRIBE]
is required though, it seems like the model is also able to summarize the audio, not just transcribe it.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun thing that I've just found out, without the
[BEGIN_AUDIO]
, the model still be able to "understand" the audio, but it sometimes thinks the audio is bunch of textThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand from the HF page, it basically has two modes:
Without the
[TRANSCRIBE]
it can understand and talk about the audio, like you've shown. In[TRANSCRIBE]
mode it only outputs the transcription, similar to Whisper, I believe.