-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Support intern-s1 #14875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Support intern-s1 #14875
Conversation
The |
@CISC hi, could you tell how to fix this error? Seems not reasonable to me
|
Running llama.cpp/convert_hf_to_gguf.py Lines 3002 to 3005 in 5eba3e3
|
convert_hf_to_gguf.py
Outdated
self._set_vocab_gpt2() | ||
|
||
def _set_vocab_interns1(self): | ||
tokens, toktypes, tokpre = self.get_vocab_base() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work because Intern-S1
requires custom code, you must handle that here and not call base class get_vocab_base
.
The Intern-S1
tokenizer looks like it's fairly special, so I think it requires custom handling to work as intended, not just using AutoTokenizer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CISC Hi, thanks for your reminder. Indeed, the intern-s1 tokenizer is special. It bases on Qwen3 bpe tokenizer, and expands with three spm tokenizer models. It uses some regex patterns to match to which sub vocab to use when tokenizing. Don't know how to implement it in llama.cpp. Do you have any suggestion for this special case? THX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No easy feat I'm afraid, SMILES
especially, you will have to add a special case for it to do the sub-vocab matching and implement tokenizer in llama-vocab.cpp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. After consideration, the sub-vocab feat would be not added in this PR.
Support internlm/Intern-S1