-
Notifications
You must be signed in to change notification settings - Fork 12.5k
model: add hunyuan dense #14878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
model: add hunyuan dense #14878
Conversation
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
This reverts commit aa973ca.
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
@@ -684,6 +684,9 @@ def get_vocab_base_pre(self, tokenizer) -> str: | |||
if chkhsh == "7e57df22b1fe23a7b1e1c7f3dc4e3f96d43a4eb0836d0c6bdc3436d7b2f1c664": | |||
# ref: https://huggingface.co/tencent/Hunyuan-A13B-Instruct | |||
res = "hunyuan" | |||
if chkhsh == "bba3b3366b646dbdded5dbc42d59598b849371afc42f7beafa914afaa5b70aa6": | |||
# ref: https://huggingface.co/tencent/Hunyuan-4B | |||
res = "hunyuan" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking; is it using the same pre-tokenizer regex as the MoE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking; is it using the same pre-tokenizer regex as the MoE?
There are two types of vocabulary in Hunyuan, regardless of whether it is moe or dense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And are they all using the same regex, ie this one?
Lines 355 to 360 in 11dd5a4
case LLAMA_VOCAB_PRE_TYPE_HUNYUAN: | |
regex_exprs = { | |
// original regex from tokenizer.json | |
// "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" | |
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", | |
}; |
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
def set_vocab(self): | ||
if (self.dir_model / "tokenizer.json").is_file(): | ||
self._set_vocab_gpt2() | ||
self.gguf_writer.add_add_bos_token(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be necessary to set this manually, a correctly configured model has this set in tokenizer_config.json
and this will be picked up from there by gguf.SpecialVocab
(called from _set_vocab_gpt2
).
@stevenkuang-tencent gentle ping |
Update: