model: Add support for GLM 4.5 family of models (#14921) #14939

sammcj · 2025-07-29T08:18:17Z

Add support for the newly released GLM 4.5 family of models.

Core Architecture

Architecture Registration: Added LLM_ARCH_GLM4_MOE enum and architecture mappings
Tensor Definitions: Complete tensor mappings for MoE components including 128 routed experts + 1 shared expert
Hybrid Layer Support: Added n_layer_dense_lead parameter to handle different dense/MoE layer patterns between variants

Model Loading (src/llama-model.cpp)

Multi-variant Support: Automatic detection and loading for both 47-layer (Air) and 93-layer (full) models
MoE Infrastructure: Complete expert weight loading with merged 3D tensor format
Graph Implementation: New llm_build_glm4_moe class with sigmoid-based expert routing and top-8 selection
Shared Expert Integration: Proper handling of shared expert computation alongside routed experts

Conversion Support (convert_hf_to_gguf.py)

HuggingFace Integration: Complete Glm4MoeModel converter class
Expert Tensor Merging: Sophisticated logic to merge expert weights into GGUF 3D tensor format
Metadata Handling: Proper extraction and conversion of MoE parameters from HuggingFace config

Technical Details

MoE Architecture

Expert Count: 128 routed experts + 1 shared expert per MoE layer
Expert Selection: Top-8 experts per token with sigmoid-based routing (not softmax)
Hybrid Layers: Dense layer for layer 0, MoE for remaining layers
Weight Format: Expert weights stored as merged [num_experts, hidden_size, ffn_size] tensors

Model Variants

GLM-4.5: 355B total parameters, 32B active, 93 layers, includes K/Q norm tensors
GLM-4.5-Air: 106B total parameters, 12B active, 47 layers, no K/Q norm tensors

The NextN/MTP prediction tensors are preserved during conversion but marked as unused since llama.cpp does not yet support multi-token prediction.

Testing

Builds successfully with no compilation errors.
convert_hf_to_gguf.py working.
llama-quantize working.

CI scripts run locally (CPU only) have two failing tests that I believe are unrelated to this change (please tell me if this isn't the case!):

94% tests passed, 2 tests failed out of 35

Label Time Summary:
main    = 251.60 sec*proc (35 tests)

Total Test time (real) = 251.61 sec

The following tests FAILED:
	 14 - test-tokenizers-ggml-vocabs (Failed)
	 27 - test-thread-safety (Subprocess aborted)


Analysis of Test Failures

1. test-tokenizers-ggml-vocabs - Corrupted Test Files

gguf_init_from_file_impl: invalid magic characters: 'vers', expected 'GGUF'
- Issue: Corrupted GGUF vocabulary files (ggml-vocab-nomic-bert-moe.gguf, etc.)
- Cause: File corruption in test environment, not code changes
- Relation to GLM 4.5: None - this is about vocabulary files, not architecture definitions

2. test-thread-safety - CUDA Environment Issues

CUDA error: unspecified launch failure
current device: 1, in function ggml_backend_cuda_synchronize
- Issue: CUDA backend threading/synchronisation failure
- Cause: CUDA driver/environment issues in CI system
- Relation to GLM 4.5: None - our changes were all CPU-side model loading logic

gguf-dump

```plain

TODO when ready


```

Disclaimer:

I am certainly not an expert in this - I think this is my first attempt at contributing a new model architecture to llama.cpp.
The most useful feedback is the code changes to make.
I did leverage the smarts of AI to help with the changes.
If this is not up to standard or I am completely off track, please feel free to reject this PR, I totally understand if someone smarter than I could do a better job of it.

Hopefully resolves #14921

CISC · 2025-07-29T08:46:08Z

Just a few quick notes from a glance:

Please name it GLM4_MOE, not GLM45
There's already LLM_KV_LEADING_DENSE_BLOCK_COUNT, no need for LLM_KV_FIRST_K_DENSE_REPLACE
Use GGML_ASSERT instead of throwing
Be mindful of whitespaces and alignments

Will do a proper review when you are ready. :)

sammcj · 2025-07-29T08:56:45Z

Hey @CISC no worries on the naming etc.. will do.
Whitespace changes will be fixed, I haven't run this through linting yet, will get back to this later tonight hopefully.

AnneKitsune · 2025-07-29T11:58:36Z

FYI when trying to run convert_hf_to_gguf.py on GLM4.5-Air-FP8, I get that some constants ending with _EXPS don't exist. If I replace these by _EXP, then I get a different error related to matrix mapping.
Thank you for working on this!

CISC · 2025-07-29T12:13:45Z

FYI when trying to run convert_hf_to_gguf.py on GLM4.5-Air-FP8, I get that some constants ending with _EXPS don't exist. If I replace these by _EXP, then I get a different error related to matrix mapping. Thank you for working on this!

That's because converting FP8 weights isn't supported yet, see #14810

sammcj · 2025-07-29T12:37:56Z

I'm close to having convert_hf_to_gguf.py and llama-quantize working (see updated PR), it completes conversion without error and I was then able to quantise to Q4_K_M.

gguf-dump worked, but llama-server picked up a tensor mapping issue with token_embd.weight, so I've just put a fix into convert_hf_to_gguf.py.

I'm going through the whole conversion then quantisation process again, it's getting late here (Hi from Melbourne 👋), so I'll come back and see if it's finished in 20~.

pwilkin · 2025-07-29T12:49:44Z

The LLM_TYPE code is wrong, those models aren't (respectively) dense 12B and 32B models. You have to add new MoE constants for them (see Qwen3 and Ernie MoEs as examples).

pwilkin · 2025-07-29T12:57:48Z

Also, you might want to include the nextn tensors instead of throwing them out - MTP support is not there yet, but that way you won't have to reconvert and requantize if/when it arrives.

sammcj · 2025-07-29T13:07:53Z

Thanks @pwilkin, LLM_TYPE updated.

I've added the nextn tensors into the conversion, skipping mapping to avoid errors.

sammcj · 2025-07-29T13:14:37Z

Note that preserving the nextn tensors does result in a larger GGUF (780 tensors -> 1184 & 214GB -> 221GB for the f16)

sammcj · 2025-07-29T13:32:44Z

I can't replicate that error @Thireus

pwilkin · 2025-07-29T13:36:09Z

Note that preserving the nextn tensors does result in a larger GGUF (780 tensors -> 1184 & 214GB -> 221GB for the f16)

Obviously, but they won't get loaded since they're not supported 😄

Also, don't make my mistake:
"torch_dtype": "bfloat16"

Don't convert to f16, do --outtype bf16 or your model will probably have errors in the tensors.

CISC · 2025-07-29T13:54:31Z

If you add unused tensors to the GGUF you must mark those tensors as unused (GGML_OP_NONE) in llama-arch.cpp, otherwise you will get an error when loading the model!

Just FYI, all other models with MTP so far have those tensors stripped.

sammcj · 2025-07-29T14:02:57Z

If you add unused tensors to the GGUF you must mark those tensors as unused (GGML_OP_NONE) in llama-arch.cpp, otherwise you will get an error when loading the model!

Ah, that'd explain why I'm getting llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 1184, got 735! - I'll push a change for that shortly @CISC.

I'll have to come back to this in the morning as it's getting late here.

If anyone is keen for this ASAP and has improvements feel free to either raise a PR against my branch or pull my commits into a PR of your own if you have a better approach and I'll review in the morning.

CISC · 2025-07-29T14:08:22Z

I'll just put it out there right now; no-one should make GGUFs from this PR public yet, there will be changes! :)

sammcj · 2025-07-29T14:09:52Z

I'll just put it out there right now; no-one should make GGUFs from this PR public yet, there will be changes! :)

Absolutely, I hope people do not do that - it's very much in draft and I'm learning as I go.

Thireus · 2025-07-29T14:46:56Z

@sammcj, 7f026fb#diff-4f653096980bd7d10518aa909cb648452cd3aa380ff93cb9fb642dca48536526 fixed the issue thanks.

ricyoung · 2025-07-29T15:14:59Z

the fix seems to work, still testing -> INFO:hf-to-gguf:Model successfully exported to models/glm-45-air-f16.gguf

Mushoz · 2025-08-04T20:28:49Z

Is there a way to disable thinking on this model through a parameter?

CISC · 2025-08-04T20:34:17Z

Is there a way to disable thinking on this model through a parameter?

Yes, the template supports enable_thinking, but you can also just add /nothink to the end of your prompt.

Mushoz · 2025-08-04T20:38:45Z

For people having the same question as I did: Make sure you use --jinja, or the enable_thinking parameter won't work :)

sammcj · 2025-08-04T20:58:49Z

A big thank you to @CISC for all your hard work on this one! 🙇

jukofyork · 2025-08-04T21:36:29Z

For people having the same question as I did: Make sure you use --jinja, or the enable_thinking parameter won't work :)

I'm a bit confused now as @sammcj posted this on Reddit not long ago:

Also - please do not use --jinja when loading the model as the official template that comes from their huggingface is broken and will cause issues.

Is there a working jinga template somewhere?

sammcj · 2025-08-04T21:54:40Z

@jukofyork

jukofyork · 2025-08-04T22:14:45Z

@jukofyork

Thanks!

AesSedai · 2025-08-05T04:52:59Z

@CISC I narrowed down the gibberish issue a bit. It requires setting --batch-size 4096 --ubatch-size 4096 and possibly having a long multi-turn chat going. When I removed the batch-size / ubatch-size, my 40k and 50k token chats began working again. Setting the sizes up to 2048 / 2048 also worked. Something about 4096 / 4096 combined with over 32k context across multiple turns leads to that gibberish edge case.

I also tried a needle in a haystack test with a 35k token prompt with a direction to answer a question from the text as a one-shot and that worked. So I don't have a reproducible smoking gun, but batch-size / ubatch-size is involved and for now I'm just scaling them back to make it work.

CISC · 2025-08-05T06:25:18Z

I also tried a needle in a haystack test with a 35k token prompt with a direction to answer a question from the text as a one-shot and that worked. So I don't have a reproducible smoking gun, but batch-size / ubatch-size is involved and for now I'm just scaling them back to make it work.

Ah, ok, so that means it's not a model issue then, that's great!

Submit an issue though. :)

CISC · 2025-08-05T06:29:11Z

Just FYI for anyone wanting to create i-quants; as the final layer will not get imatrix data until MTP is supported it has to be overridden for lower quants to work, eg. using --tensor-type 46=iq4_xs or --tensor-type 92=iq4_xs.

cc/ @bartowski1182 @danielhanchen @nicoboss

jacekpoplawski · 2025-08-05T09:48:44Z

I am getting over 45t/s on three 3090s on unsloth quant Q4 for GLM Air, here is the optimized command:

llama-server -ts 18/17/18 -ngl 99 -m ~/models/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf --n-cpu-moe 2 --jinja --host 0.0.0.0

jukofyork · 2025-08-05T09:59:29Z

1. It still seems to be skipping warmup. It's loading the model into system RAM **after** receiving the first prompt.

I can confirm it's not warming up.

Manually setting --override-kv glm4moe.expert_used_count=int:160 to try to get it to warm up triggers:

ggml_new_object: not enough space in the context's memory pool (needed 5730848, available 5730480)

If I patch src/llama-context.cpp:

uint32_t llama_context::graph_max_nodes() const {
    //return std::max<uint32_t>(1024u, 8u*model.n_tensors());
    return std::max<uint32_t>(65536u, 8u*model.n_tensors());
}

and then run with --override-kv glm4moe.expert_used_count=int:160 it warms up fine.

You then need to rerun without --override-kv glm4moe.expert_used_count=int:160.

I've got to go out so no more time to investigate until later.

jukofyork · 2025-08-05T10:02:09Z

Actually, no it's still not warming up properly - it's just a lot quicker because it's got the experts mmapped I think... Will see if I can figure it out later if nobody else has by then.

* model: Add GLM 4.5 (ggml-org#14921) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Merge in PR suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model: Add GLM 4.5 family of models (ggml-org#14921) 1. Updated tensor_mapping.py with NextN tensor mappings - Added proper tensor mappings for all NextN/MTP tensors in /Users/samm/git/llama.cpp/gguf-py/gguf/tensor_mapping.py - Added mappings for: eh_proj, embed_tokens, enorm, hnorm, shared_head.head, shared_head.norm 2. Added num_nextn_predict_layers configuration - Added LLM_KV_NUM_NEXTN_PREDICT_LAYERS constant to llama-arch.h and llama-arch.cpp - Added num_nextn_predict_layers field to llama_hparams struct - Updated GLM4_MOE parameter loading in llama-model.cpp to read this parameter - Modified tensor loading logic to conditionally load NextN tensors based on num_nextn_predict_layers - Added GGUF writer support in gguf_writer.py with add_num_nextn_predict_layers() method - Updated conversion script to extract and write this parameter from HuggingFace config 3. Added FIM tokens for GLM4_MOE - Added GLM-4.5's FIM tokens to llama-vocab.cpp: - <|code_prefix|> for FIM_PRE - <|code_suffix|> for FIM_SUF - <|code_middle|> for FIM_MID 4. Removed manual NextN tensor handling - Removed the special-case handling in convert_hf_to_gguf.py that manually mapped NextN tensors - NextN tensors are now handled automatically through the proper tensor mapping system * glm 4.5 update tensors names * model: glm 4.5 apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model: glm 4.5 apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * model: glm 4.5 apply suggestions from code review * Apply suggestions from code review * patch broken chat template * typings fix * add TENSOR_SKIP flag Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update src/llama-model-loader.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>

jukofyork · 2025-08-05T12:36:51Z

I've found it:

                // MoE layer with shared experts
                //const int64_t n_expert      = hparams.n_expert;
                //const int64_t n_expert_used = hparams.n_expert_used;

                // Process routed experts using existing MoE infrastructure
                ggml_tensor * routed_out = build_moe_ffn(cur,
                        model.layers[il].ffn_gate_inp,
                        model.layers[il].ffn_up_exps,
                        model.layers[il].ffn_gate_exps,
                        model.layers[il].ffn_down_exps,
                        model.layers[il].ffn_exp_probs_b,
                        n_expert, n_expert_used,
                        LLM_FFN_SILU, hparams.expert_weights_norm,
                        true, hparams.expert_weights_scale,
                        (llama_expert_gating_func_type) hparams.expert_gating_func,
                        il);
                cb(routed_out, "ffn_moe_out", il);

The local n_expert and n_expert_used were shadowing those set here:

llm_graph_context::llm_graph_context(const llm_graph_params & params) :
    arch             (params.arch),
    hparams          (params.hparams),
    cparams          (params.cparams),
    ubatch           (params.ubatch),
    n_embd           (hparams.n_embd),
    n_layer          (hparams.n_layer),
    n_rot            (hparams.n_rot),
    n_ctx            (cparams.n_ctx),
    n_head           (hparams.n_head()),
    n_head_kv        (hparams.n_head_kv()),
    n_embd_head_k    (hparams.n_embd_head_k),
    n_embd_k_gqa     (hparams.n_embd_k_gqa()),
    n_embd_head_v    (hparams.n_embd_head_v),
    n_embd_v_gqa     (hparams.n_embd_v_gqa()),
    n_expert         (hparams.n_expert),
    n_expert_used    (cparams.warmup ? hparams.n_expert : hparams.n_expert_used),
    freq_base        (cparams.rope_freq_base),
    freq_scale       (cparams.rope_freq_scale),
    ext_factor       (cparams.yarn_ext_factor),
    attn_factor      (cparams.yarn_attn_factor),
    beta_fast        (cparams.yarn_beta_fast),
    beta_slow        (cparams.yarn_beta_slow),
    norm_eps         (hparams.f_norm_eps),
    norm_rms_eps     (hparams.f_norm_rms_eps),
    n_tokens         (ubatch.n_tokens),
    n_outputs        (params.n_outputs),
    n_ctx_orig       (cparams.n_ctx_orig_yarn),
    pooling_type     (cparams.pooling_type),
    rope_type        (hparams.rope_type),
    sched            (params.sched),
    backend_cpu      (params.backend_cpu),
    cvec             (params.cvec),
    loras            (params.loras),
    mctx             (params.mctx),
    cross            (params.cross),
    cb_func          (params.cb),
    res              (params.res),
    ctx0             (res->get_ctx()),
    gf               (res->get_gf()) {
        res->set_params(params);
    }

jukofyork · 2025-08-05T12:46:30Z

#15088

createthis · 2025-08-05T13:16:06Z

@jukofyork confirmed. This fixes warmup for me. It also restores the GLM-4.5 to the performance levels I've come to expect from llama.cpp:

Startup command:

./build/bin/llama-server \
    --model /data/GLM-4.5-GGUF/q4_k_m/GLM-4.5-Q4_K_M.gguf \
    --alias GLM-4.5-GGUF:q4_k_m \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 131072 \
    --n-gpu-layers 94 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --temp 0.6 \
    --top-p 1.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --port 11434

I had GLM-4.5 write a poem for you:

Jukofyork, with skillful hand,
Commit c81de6e fixed the land.
GLM-4.5 warmup, once so slow,
Now performs with steady glow.
Removed those lines that caused the pain,
Llama.cpp runs fast again.

jukofyork · 2025-08-05T13:42:07Z

No problem and I can confirm it's running as expected for me now too (~6.5 tokens/s generation).

I'm managed to transplant the vocab into Qwen2.5-Coder-0.5B-Instruct:

Loading config from 'Qwen2.5-Coder-0.5B-Instruct'... Done.
Loading config from 'GLM-4.5'... Done.
Loading tokenizer from 'Qwen2.5-Coder-0.5B-Instruct'... Done.
Loading tokenizer from 'GLM-4.5'... Done.
Loading model from 'Qwen2.5-Coder-0.5B-Instruct'... Done.

Input model configuration:
- Target vocabulary size    : 151552 (used = 151365, unused = 187)
- Donor vocabulary size     : 151936
- Donor num layers          : 24 (tied embeddings = True)
- Donor hidden size         : 896
- Donor attention heads     : 14
- Donor intermediate size   : 4864 (ratio = 1:5.4)
- Donor total parameters    : 494032768 (0.49B)
-- Embedding parameters     : 136134656 (0.14B)
-- Non-embedding parameters : 357898112 (0.36B)

Processing 3 automatic token overrides:
✘ 'bos_token_id' : Not found for target model
✔ 'eos_token_id' : 151329 '<|endoftext|>' → [151645] '<|im_end|>'
✘ 'pad_token_id' : 151329 is already mapped to [151645]

Processing 14 manual token overrides:
✔ 151329 : '<|endoftext|>' → [151643] '<|endoftext|>'
✔ 151330 : '[MASK]' → [151643] '<|endoftext|>'
✔ 151331 : '[gMASK]' → [151643] '<|endoftext|>'
✔ 151332 : '[sMASK]' → [151643] '<|endoftext|>'
✔ 151333 : '<sop>' → [151643] '<|endoftext|>'
✔ 151334 : '<eop>' → [151643] '<|endoftext|>'
✔ 151335 : '<|system|>' → [151644, 8948] '<|im_start|>system'
✔ 151336 : '<|user|>' → [151644, 872] '<|im_start|>user'
✔ 151337 : '<|assistant|>' → [151644, 77091] '<|im_start|>assistant'
✔ 151338 : '<|observation|>' → [151644, 872] '<|im_start|>user'
✔ 151352 : '<tool_call>' → [151657] '<tool_call>'
✔ 151353 : '</tool_call>' → [151658] '</tool_call>'
✔ 151354 : '<tool_response>' → [151657] '<tool_call>'
✔ 151355 : '</tool_response>' → [151658] '</tool_call>'

NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...

Transplanting tokens: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 151365/151365 [00:42<00:00, 3558.20token/s]

Transplant mappings:
- 1 to 1  : 123102 (81%)
- 2 to 1  : 23944 (16%)
- 3 to 1  : 3262 (2.2%)
- 4 to 1  : 821 (0.54%)
- 5 to 1  : 181 (0.12%)
- 6 to 1  : 26 (0.017%)
- 7 to 1  : 21 (0.014%)
- 8 to 1  : 5 (0.0033%)
- 9 to 1  : 1 (0.00066%)
- 13 to 1 : 1 (0.00066%)
- 16 to 1 : 1 (0.00066%)

Head initialized with:
- Copies : 123102 (81%)
- Means  : 28263 (19%)
- Zeros  : 187 (0.12%)

Output model configuration:
- Output vocabulary size    : 151552
- Output num layers         : 24 (tied embeddings = False)
- Output hidden size        : 896
- Output attention heads    : 14
- Output intermediate size  : 4864 (ratio = 1:5.4)
- Output total parameters   : 629479296 (0.63B)
-- Embedding parameters     : 271581184 (0.27B)
-- Non-embedding parameters : 357898112 (0.36B)

Saving model and tokenizer to 'GLM-4.5-DRAFT-0.6B-UNTRAINED' folder

so assuming it trains OK, then we should have a draft model in a day or so.

It actually looks to have transplanted very well, as even the untrained draft is getting a high acceptance rate for refactoring tasks:

prompt eval time =   59625.37 ms /  2339 tokens (   25.49 ms per token,    39.23 tokens per second)
       eval time =  288397.17 ms /  3170 tokens (   90.98 ms per token,    10.99 tokens per second)
      total time =  348022.54 ms /  5509 tokens
slot print_timing: id  0 | task 0 | 
draft acceptance rate = 0.74499 ( 2080 accepted /  2792 generated)

Mushoz · 2025-08-06T12:56:01Z

Yesterday a bug was found for these models in vLLM and it was patched out. The PR in question is this one: vllm-project/vllm#22203

Does anyone know if this implementation is using float32 data for the self.gate module? Because if not, it might need a similar fix.

jacekpoplawski · 2025-08-06T13:18:42Z

Yesterday a bug was found for these models in vLLM and it was patched out. The PR in question is this one: vllm-project/vllm#22203

Does anyone know if this implementation is using float32 data for the self.gate module? Because if not, it might need a similar fix.

isn't this related?

if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE) {
            // GLM4 and GLM4_MOE seem to have numerical issues with half-precision accumulators
            ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
        }

CISC · 2025-08-06T13:28:41Z

isn't this related?

That's just for ffn_down though, this suggests it should be done for ffn_gate too (if it's enough to just up the precision of mul-mat), can someone test?

jacekpoplawski · 2025-08-06T14:19:59Z

isn't this related?

That's just for ffn_down though, this suggests it should be done for ffn_gate too (if it's enough to just up the precision of mul-mat), can someone test?

I am not sure how to see the difference. Should the perplexity change? I tried following fix, but the perplexity stays the same

diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 053c72d6..4c101848 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -662,11 +662,19 @@ ggml_tensor * llm_graph_context::build_ffn(
             case LLM_FFN_SEQ:
                 {
                     cur = build_lora_mm(gate, tmp);
+                    if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE) {
+                        // GLM4 and GLM4_MOE seem to have numerical issues with half-precision accumulators
+                        ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
+                    }
                     cb(cur, "ffn_gate", il);
                 } break;
             case LLM_FFN_PAR:
                 {
                     cur = build_lora_mm(gate, cur);
+                    if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE) {
+                        // GLM4 and GLM4_MOE seem to have numerical issues with half-precision accumulators
+                        ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
+                    }
                     cb(cur, "ffn_gate", il);
                 } break;
         }
@@ -746,6 +754,10 @@ ggml_tensor * llm_graph_context::build_ffn(

     if (gate && type_gate == LLM_FFN_PAR) {
         cur = ggml_mul(ctx0, cur, tmp);
+        if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE) {
+            // GLM4 and GLM4_MOE seem to have numerical issues with half-precision accumulators
+            ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
+        }
         cb(cur, "ffn_gate_par", il);
     }

jukofyork · 2025-08-06T14:55:24Z

isn't this related?

That's just for ffn_down though, this suggests it should be done for ffn_gate too (if it's enough to just up the precision of mul-mat), can someone test?

Just checked and it's the [hidden_dim, n_experts] router logits tensor:

        router_logits, _ = self.gate(hidden_states.to(dtype=torch.float32))
        router_logits, _ = self.gate(hidden_states)

which I think is always kept as float32 for all MoE models in llama.cpp anyway, so should be fine.

jukofyork · 2025-08-06T15:00:24Z

                # Conditions should closely match those in llama_model_quantize_internal in llama.cpp
                # Some tensor types are always in float32
                if data_qtype is False and (
                    any(
                        self.match_model_tensor_name(new_name, key, bid)
                        for key in (
                            gguf.MODEL_TENSOR.FFN_GATE_INP,
                            gguf.MODEL_TENSOR.POS_EMBD,
                            gguf.MODEL_TENSOR.TOKEN_TYPES,
                            gguf.MODEL_TENSOR.SSM_CONV1D,
                            gguf.MODEL_TENSOR.SHORTCONV_CONV,
                            gguf.MODEL_TENSOR.TIME_MIX_FIRST,
                            gguf.MODEL_TENSOR.TIME_MIX_W1,
                            gguf.MODEL_TENSOR.TIME_MIX_W2,
                            gguf.MODEL_TENSOR.TIME_MIX_DECAY_W1,
                            gguf.MODEL_TENSOR.TIME_MIX_DECAY_W2,
                            gguf.MODEL_TENSOR.TIME_MIX_LERP_FUSED,
                            gguf.MODEL_TENSOR.POSNET_NORM1,
                            gguf.MODEL_TENSOR.POSNET_NORM2,
                            gguf.MODEL_TENSOR.V_ENC_EMBD_POS,
                            gguf.MODEL_TENSOR.A_ENC_EMBD_POS,
                            gguf.MODEL_TENSOR.ALTUP_CORRECT_COEF,
                            gguf.MODEL_TENSOR.ALTUP_PREDICT_COEF,
                        )
                    )
                    or not new_name.endswith(".weight")
                ):
                    data_qtype = gguf.GGMLQuantizationType.F32

        // do not quantize expert gating tensors
        // NOTE: can't use LLM_TN here because the layer number is not known
        quantize &= name.find("ffn_gate_inp.weight") == std::string::npos;

then IIRC, in the backends any time a float32 is the left tensor in a matrix product, the other side gets promoted to float32 too.

CISC · 2025-08-06T16:02:59Z

@jukofyork Thanks for checking, then all is good.

Thireus · 2025-08-07T05:06:44Z

Many thanks to @sammcj, @CISC, and everyone who contributed! The code has been successfully ported and merged into ik_llama.

github-actions bot added the python python script changes label Jul 29, 2025

sammcj force-pushed the glm-4-5 branch 2 times, most recently from 5da3811 to ec5c193 Compare July 29, 2025 08:44

sammcj force-pushed the glm-4-5 branch 2 times, most recently from c4dbf69 to b4c60e1 Compare July 29, 2025 11:24

sammcj force-pushed the glm-4-5 branch 2 times, most recently from 1957023 to 4397ccb Compare July 29, 2025 12:33

sammcj force-pushed the glm-4-5 branch from 272c1f0 to e4c20b6 Compare July 29, 2025 13:05

sammcj force-pushed the glm-4-5 branch from 9f797b9 to 80a0594 Compare July 29, 2025 13:30

sammcj force-pushed the glm-4-5 branch from 80a0594 to cc83494 Compare July 29, 2025 14:00

sammcj force-pushed the glm-4-5 branch 2 times, most recently from 9d6ea41 to 7f026fb Compare July 29, 2025 14:31

sammcj force-pushed the glm-4-5 branch from 633a58a to 9a249d6 Compare July 29, 2025 22:24

matbgn mentioned this pull request Aug 4, 2025

Support for zai-org/GLM-4.5 (Thinking & Non-Thinking Modes + Tool Use) ollama/ollama#11563

Open

xunjieliu mentioned this pull request Aug 5, 2025

Reddit News Daily 2025-08-05 xunjieliu/reddit-daily-news#142

Open

jukofyork mentioned this pull request Aug 5, 2025

Fix GLM 4.5 warmup bug #15088

Merged

Mushoz mentioned this pull request Aug 6, 2025

self.gate dtype update for GLM-4.5 vllm-project/vllm#22203

Merged

model: Add support for GLM 4.5 family of models (#14921) #14939

model: Add support for GLM 4.5 family of models (#14921) #14939

Conversation

sammcj commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core Architecture

Model Loading (src/llama-model.cpp)

Conversion Support (convert_hf_to_gguf.py)

Technical Details

MoE Architecture

Model Variants

Testing

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

sammcj commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnneKitsune commented Jul 29, 2025

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

sammcj commented Jul 29, 2025

Uh oh!

pwilkin commented Jul 29, 2025

Uh oh!

pwilkin commented Jul 29, 2025

Uh oh!

sammcj commented Jul 29, 2025

Uh oh!

sammcj commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sammcj commented Jul 29, 2025

Uh oh!

pwilkin commented Jul 29, 2025

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

sammcj commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Jul 29, 2025

Uh oh!

sammcj commented Jul 29, 2025

Uh oh!

Thireus commented Jul 29, 2025

Uh oh!

ricyoung commented Jul 29, 2025

Uh oh!

Mushoz commented Aug 4, 2025

Uh oh!

CISC commented Aug 4, 2025

Uh oh!

Mushoz commented Aug 4, 2025

Uh oh!

sammcj commented Aug 4, 2025

Uh oh!

jukofyork commented Aug 4, 2025

Uh oh!

sammcj commented Aug 4, 2025

Uh oh!

jukofyork commented Aug 4, 2025

Uh oh!

AesSedai commented Aug 5, 2025

Uh oh!

CISC commented Aug 5, 2025

Uh oh!

CISC commented Aug 5, 2025

Uh oh!

jacekpoplawski commented Aug 5, 2025

Uh oh!

jukofyork commented Aug 5, 2025

Uh oh!

jukofyork commented Aug 5, 2025

Uh oh!

jukofyork commented Aug 5, 2025

Uh oh!

jukofyork commented Aug 5, 2025

Uh oh!

sammcj commented Jul 29, 2025 •

edited

Loading

sammcj commented Jul 29, 2025 •

edited

Loading

sammcj commented Jul 29, 2025 •

edited

Loading

sammcj commented Jul 29, 2025 •

edited

Loading

createthis commented Aug 5, 2025 •

edited

Loading

jacekpoplawski commented Aug 6, 2025 •

edited

Loading