-
Notifications
You must be signed in to change notification settings - Fork 12.6k
model: Add support for GLM 4.5 family of models (#14921) #14939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5da3811
to
ec5c193
Compare
Just a few quick notes from a glance:
Will do a proper review when you are ready. :) |
Hey @CISC no worries on the naming etc.. will do. |
c4dbf69
to
b4c60e1
Compare
FYI when trying to run |
That's because converting FP8 weights isn't supported yet, see #14810 |
1957023
to
4397ccb
Compare
I'm close to having convert_hf_to_gguf.py and llama-quantize working (see updated PR), it completes conversion without error and I was then able to quantise to Q4_K_M. gguf-dump worked, but llama-server picked up a tensor mapping issue with token_embd.weight, so I've just put a fix into convert_hf_to_gguf.py. I'm going through the whole conversion then quantisation process again, it's getting late here (Hi from Melbourne 👋), so I'll come back and see if it's finished in 20~. |
The LLM_TYPE code is wrong, those models aren't (respectively) dense 12B and 32B models. You have to add new MoE constants for them (see Qwen3 and Ernie MoEs as examples). |
Also, you might want to include the nextn tensors instead of throwing them out - MTP support is not there yet, but that way you won't have to reconvert and requantize if/when it arrives. |
Thanks @pwilkin, LLM_TYPE updated. I've added the nextn tensors into the conversion, skipping mapping to avoid errors. |
Note that preserving the nextn tensors does result in a larger GGUF (780 tensors -> 1184 & 214GB -> 221GB for the f16) |
I can't replicate that error @Thireus |
Obviously, but they won't get loaded since they're not supported 😄 Also, don't make my mistake: Don't convert to f16, do --outtype bf16 or your model will probably have errors in the tensors. |
If you add unused tensors to the GGUF you must mark those tensors as unused ( Just FYI, all other models with MTP so far have those tensors stripped. |
Ah, that'd explain why I'm getting I'll have to come back to this in the morning as it's getting late here. If anyone is keen for this ASAP and has improvements feel free to either raise a PR against my branch or pull my commits into a PR of your own if you have a better approach and I'll review in the morning. |
I'll just put it out there right now; no-one should make GGUFs from this PR public yet, there will be changes! :) |
Absolutely, I hope people do not do that - it's very much in draft and I'm learning as I go. |
9d6ea41
to
7f026fb
Compare
@sammcj, 7f026fb#diff-4f653096980bd7d10518aa909cb648452cd3aa380ff93cb9fb642dca48536526 fixed the issue thanks. |
the fix seems to work, still testing -> INFO:hf-to-gguf:Model successfully exported to models/glm-45-air-f16.gguf |
Is there a way to disable thinking on this model through a parameter? |
Yes, the template supports |
For people having the same question as I did: Make sure you use |
A big thank you to @CISC for all your hard work on this one! 🙇 |
I'm a bit confused now as @sammcj posted this on Reddit not long ago:
Is there a working jinga template somewhere? |
Thanks! |
@CISC I narrowed down the gibberish issue a bit. It requires setting --batch-size 4096 --ubatch-size 4096 and possibly having a long multi-turn chat going. When I removed the batch-size / ubatch-size, my 40k and 50k token chats began working again. Setting the sizes up to 2048 / 2048 also worked. Something about 4096 / 4096 combined with over 32k context across multiple turns leads to that gibberish edge case. I also tried a needle in a haystack test with a 35k token prompt with a direction to answer a question from the text as a one-shot and that worked. So I don't have a reproducible smoking gun, but batch-size / ubatch-size is involved and for now I'm just scaling them back to make it work. |
Ah, ok, so that means it's not a model issue then, that's great! Submit an issue though. :) |
Just FYI for anyone wanting to create i-quants; as the final layer will not get imatrix data until MTP is supported it has to be overridden for lower quants to work, eg. using |
I am getting over 45t/s on three 3090s on unsloth quant Q4 for GLM Air, here is the optimized command:
|
I can confirm it's not warming up. Manually setting
If I patch uint32_t llama_context::graph_max_nodes() const {
//return std::max<uint32_t>(1024u, 8u*model.n_tensors());
return std::max<uint32_t>(65536u, 8u*model.n_tensors());
} and then run with You then need to rerun without I've got to go out so no more time to investigate until later. |
Actually, no it's still not warming up properly - it's just a lot quicker because it's got the experts mmapped I think... Will see if I can figure it out later if nobody else has by then. |
* model: Add GLM 4.5 (ggml-org#14921) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Merge in PR suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model: Add GLM 4.5 family of models (ggml-org#14921) 1. Updated tensor_mapping.py with NextN tensor mappings - Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py - Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm 2. Added num_nextn_predict_layers configuration - Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp - Added num_nextn_predict_layers field to llama_hparams struct - Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter - Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers - Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method - Updated conversion script to extract and write this parameter from HuggingFace config 3. Added FIM tokens for GLM4_MOE - Added GLM-4.5's FIM tokens to llama-vocab.cpp: - <|code_prefix|> for FIM_PRE - <|code_suffix|> for FIM_SUF - <|code_middle|> for FIM_MID 4. Removed manual NextN tensor handling - Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors - NextN tensors are now handled automatically through the proper tensor mapping system * glm 4.5 update tensors names * model: glm 4.5 apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model: glm 4.5 apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model: glm 4.5 apply suggestions from code review * Apply suggestions from code review * patch broken chat template * typings fix * add TENSOR_SKIP flag Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update src/llama-model-loader.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>
I've found it: // MoE layer with shared experts
//const int64_t n_expert = hparams.n_expert;
//const int64_t n_expert_used = hparams.n_expert_used;
// Process routed experts using existing MoE infrastructure
ggml_tensor * routed_out = build_moe_ffn(cur,
model.layers[il].ffn_gate_inp,
model.layers[il].ffn_up_exps,
model.layers[il].ffn_gate_exps,
model.layers[il].ffn_down_exps,
model.layers[il].ffn_exp_probs_b,
n_expert, n_expert_used,
LLM_FFN_SILU, hparams.expert_weights_norm,
true, hparams.expert_weights_scale,
(llama_expert_gating_func_type) hparams.expert_gating_func,
il);
cb(routed_out, "ffn_moe_out", il); The local llm_graph_context::llm_graph_context(const llm_graph_params & params) :
arch (params.arch),
hparams (params.hparams),
cparams (params.cparams),
ubatch (params.ubatch),
n_embd (hparams.n_embd),
n_layer (hparams.n_layer),
n_rot (hparams.n_rot),
n_ctx (cparams.n_ctx),
n_head (hparams.n_head()),
n_head_kv (hparams.n_head_kv()),
n_embd_head_k (hparams.n_embd_head_k),
n_embd_k_gqa (hparams.n_embd_k_gqa()),
n_embd_head_v (hparams.n_embd_head_v),
n_embd_v_gqa (hparams.n_embd_v_gqa()),
n_expert (hparams.n_expert),
n_expert_used (cparams.warmup ? hparams.n_expert : hparams.n_expert_used),
freq_base (cparams.rope_freq_base),
freq_scale (cparams.rope_freq_scale),
ext_factor (cparams.yarn_ext_factor),
attn_factor (cparams.yarn_attn_factor),
beta_fast (cparams.yarn_beta_fast),
beta_slow (cparams.yarn_beta_slow),
norm_eps (hparams.f_norm_eps),
norm_rms_eps (hparams.f_norm_rms_eps),
n_tokens (ubatch.n_tokens),
n_outputs (params.n_outputs),
n_ctx_orig (cparams.n_ctx_orig_yarn),
pooling_type (cparams.pooling_type),
rope_type (hparams.rope_type),
sched (params.sched),
backend_cpu (params.backend_cpu),
cvec (params.cvec),
loras (params.loras),
mctx (params.mctx),
cross (params.cross),
cb_func (params.cb),
res (params.res),
ctx0 (res->get_ctx()),
gf (res->get_gf()) {
res->set_params(params);
} |
@jukofyork confirmed. This fixes warmup for me. It also restores the GLM-4.5 to the performance levels I've come to expect from llama.cpp: ![]() Startup command: ./build/bin/llama-server \
--model /data/GLM-4.5-GGUF/q4_k_m/GLM-4.5-Q4_K_M.gguf \
--alias GLM-4.5-GGUF:q4_k_m \
--no-webui \
--numa numactl \
--threads 32 \
--ctx-size 131072 \
--n-gpu-layers 94 \
-ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.ffn_.*=CUDA0" \
-ot exps=CPU \
-ub 4096 -b 4096 \
--seed 3407 \
--temp 0.6 \
--top-p 1.0 \
--log-colors \
--flash-attn \
--host 0.0.0.0 \
--jinja \
--port 11434 I had GLM-4.5 write a poem for you:
|
No problem and I can confirm it's running as expected for me now too (~6.5 tokens/s generation). I'm managed to transplant the vocab into
so assuming it trains OK, then we should have a draft model in a day or so. It actually looks to have transplanted very well, as even the untrained draft is getting a high acceptance rate for refactoring tasks:
|
Yesterday a bug was found for these models in vLLM and it was patched out. The PR in question is this one: vllm-project/vllm#22203 Does anyone know if this implementation is using float32 data for the self.gate module? Because if not, it might need a similar fix. |
isn't this related?
|
That's just for |
I am not sure how to see the difference. Should the perplexity change? I tried following fix, but the perplexity stays the same
|
Just checked and it's the router_logits, _ = self.gate(hidden_states.to(dtype=torch.float32))
router_logits, _ = self.gate(hidden_states) which I think is always kept as |
# Conditions should closely match those in llama_model_quantize_internal in llama.cpp
# Some tensor types are always in float32
if data_qtype is False and (
any(
self.match_model_tensor_name(new_name, key, bid)
for key in (
gguf.MODEL_TENSOR.FFN_GATE_INP,
gguf.MODEL_TENSOR.POS_EMBD,
gguf.MODEL_TENSOR.TOKEN_TYPES,
gguf.MODEL_TENSOR.SSM_CONV1D,
gguf.MODEL_TENSOR.SHORTCONV_CONV,
gguf.MODEL_TENSOR.TIME_MIX_FIRST,
gguf.MODEL_TENSOR.TIME_MIX_W1,
gguf.MODEL_TENSOR.TIME_MIX_W2,
gguf.MODEL_TENSOR.TIME_MIX_DECAY_W1,
gguf.MODEL_TENSOR.TIME_MIX_DECAY_W2,
gguf.MODEL_TENSOR.TIME_MIX_LERP_FUSED,
gguf.MODEL_TENSOR.POSNET_NORM1,
gguf.MODEL_TENSOR.POSNET_NORM2,
gguf.MODEL_TENSOR.V_ENC_EMBD_POS,
gguf.MODEL_TENSOR.A_ENC_EMBD_POS,
gguf.MODEL_TENSOR.ALTUP_CORRECT_COEF,
gguf.MODEL_TENSOR.ALTUP_PREDICT_COEF,
)
)
or not new_name.endswith(".weight")
):
data_qtype = gguf.GGMLQuantizationType.F32 // do not quantize expert gating tensors
// NOTE: can't use LLM_TN here because the layer number is not known
quantize &= name.find("ffn_gate_inp.weight") == std::string::npos; then IIRC, in the backends any time a |
@jukofyork Thanks for checking, then all is good. |
Add support for the newly released GLM 4.5 family of models.
Core Architecture
Model Loading (src/llama-model.cpp)
Conversion Support (convert_hf_to_gguf.py)
Technical Details
MoE Architecture
Model Variants
The NextN/MTP prediction tensors are preserved during conversion but marked as unused since llama.cpp does not yet support multi-token prediction.
Testing
CI scripts run locally (CPU only) have two failing tests that I believe are unrelated to this change (please tell me if this isn't the case!):
gguf-dump
Disclaimer:
Hopefully resolves #14921