Skip to content

model: add hunyuan dense #14878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

stevenkuang-tencent
Copy link
Contributor

@stevenkuang-tencent stevenkuang-tencent commented Jul 25, 2025

Update:

  • Support hunyuan_dense
  • fix hunyuan_moe chat template

Signed-off-by: stevenkuang <stevenkuang@tencent.com>
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
@github-actions github-actions bot added the python python script changes label Jul 25, 2025
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
@stevenkuang-tencent stevenkuang-tencent changed the title model: add hunyuan v1 dense model: add hunyuan dense Jul 25, 2025
@@ -684,6 +684,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
if chkhsh == "7e57df22b1fe23a7b1e1c7f3dc4e3f96d43a4eb0836d0c6bdc3436d7b2f1c664":
# ref: https://huggingface.co/tencent/Hunyuan-A13B-Instruct
res = "hunyuan"
if chkhsh == "bba3b3366b646dbdded5dbc42d59598b849371afc42f7beafa914afaa5b70aa6":
# ref: https://huggingface.co/tencent/Hunyuan-4B
res = "hunyuan"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking; is it using the same pre-tokenizer regex as the MoE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking; is it using the same pre-tokenizer regex as the MoE?

There are two types of vocabulary in Hunyuan, regardless of whether it is moe or dense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And are they all using the same regex, ie this one?

case LLAMA_VOCAB_PRE_TYPE_HUNYUAN:
regex_exprs = {
// original regex from tokenizer.json
// "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
};

Signed-off-by: stevenkuang <stevenkuang@tencent.com>
Signed-off-by: stevenkuang <stevenkuang@tencent.com>
@stevenkuang-tencent stevenkuang-tencent requested a review from CISC July 26, 2025 17:19
def set_vocab(self):
if (self.dir_model / "tokenizer.json").is_file():
self._set_vocab_gpt2()
self.gguf_writer.add_add_bos_token(True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be necessary to set this manually, a correctly configured model has this set in tokenizer_config.json and this will be picked up from there by gguf.SpecialVocab (called from _set_vocab_gpt2).

@CISC
Copy link
Collaborator

CISC commented Jul 29, 2025

@stevenkuang-tencent gentle ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants