Add afmoe model #42168

alyosha-swamy · 2025-11-12T17:52:07Z

Summary

This PR adds support for the AFMoE (Arcee Foundational Mixture of Experts) model architecture for the upcoming Trinity-Mini and Trinity-Nano releases. AFMoE is a decoder-only transformer model featuring a sparse Mixture of Experts (MoE) approach, combining token-choice routing with shared experts and several architectural innovations for efficient inference and improved performance.

Model Description

AFMoE features the following key architectural components:

Mixture of Experts with Shared Experts: Combines routed experts (activated per-token via learned routing) with always-active shared experts for stable base computation
Token-Choice Routing: Uses sigmoid or softmax-based routing with normalization and scaling for expert selection
Q/K Normalization and Gating: Applies RMSNorm to query and key projections and uses sigmoid gating on attention outputs for improved training stability
Hybrid Attention Patterns: Alternates between sliding window attention and full attention across layers for efficiency with long contexts
Dual Normalization: Uses pre- and post-normalization around both attention and MLP blocks for training stability
Configurable Dense Layers: Allows initial layers to use dense MLPs before transitioning to sparse MoE layers (num_dense_layers)

Implementation Details

Modular implementation leveraging transformers' modular architecture:
- Efficient AfmoeRMSNorm for layer normalization
- AfmoeRotaryEmbedding for positional encoding
- AfmoeAttention class implementing Q/K normalization and output gating
- AfmoeTokenChoiceRouter for expert selection
- AfmoeMoE class implementing shared + routed experts architecture
- AfmoeDecoderLayer integrating attention and MoE blocks with dual normalization

Testing

Added comprehensive test suite following standard transformers test patterns
Tests for core functionality:
- Model initialization and weight loading
- Forward and backward passes
- Attention mechanism (sliding window + full attention patterns)
- MoE routing and expert selection
- RoPE embeddings
- KV cache compatibility
Integration tests with example checkpoints
Verified compatibility with existing transformer infrastructure
Model loading and inference verified with arcee-ai/Trinity-Mini

Documentation

Comprehensive model documentation in docs/source/en/model_doc/afmoe.md
Detailed architecture descriptions and usage examples
All configuration parameters documented with clear descriptions
Example code for both Pipeline and AutoModel usage patterns

ArthurZucker

nice work!

ArthurZucker · 2025-11-14T13:06:47Z

src/transformers/models/afmoe/modular_afmoe.py

+def rotate_half(x: torch.Tensor) -> torch.Tensor:
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids: Optional[torch.Tensor] = None, unsqueeze_dim: int = 1):
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs,
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+
+    return attn_output, attn_weights


these can also be imported from Llama! 😉

src/transformers/models/afmoe/modular_afmoe.py

ArthurZucker · 2025-11-14T13:09:21Z

src/transformers/models/afmoe/modular_afmoe.py

+        top_scores, selected_experts = self.router(hidden_states, self.expert_bias)
+
+        # Process through shared experts
+        if self.shared_experts is not None:


same comment, is this used by the released model or not?

not adressed

the first layer is a standard dense FFN, and all subsequent layers use the MoE block

In that case the arch should be different! Use normal MLP for mlp and expert for expert! 🤗
You can set the first layer then just do += we want to avoid codepathes as much as possible

ArthurZucker · 2025-11-14T13:11:40Z

src/transformers/models/afmoe/modular_afmoe.py

+        # MoE or dense FFN
+        self.moe_enabled = layer_idx >= config.num_dense_layers
+        if self.moe_enabled:
+            self.mlp = AfmoeMoE(config)
+        else:
+            self.mlp = AfmoeMLP(config)


is moe disabled on any of the released ckpts? 🤗

src/transformers/models/afmoe/modular_afmoe.py

tests/models/afmoe/test_modeling_afmoe.py

ArthurZucker · 2025-11-28T20:45:20Z

src/transformers/models/afmoe/modular_afmoe.py

+    This mirrors the Experts pattern used across other MoE models to ease checkpoint conversion.
+    """
+
+    _checkpoint_conversion_mapping = {"experts": "experts"}


Suggested change

_checkpoint_conversion_mapping = {"experts": "experts"}

ArthurZucker · 2025-11-28T20:45:37Z

src/transformers/models/afmoe/modular_afmoe.py

+        top_scores, selected_experts = self.router(hidden_states, self.expert_bias)
+
+        # Process through shared experts
+        if self.shared_experts is not None:


not adressed

ArthurZucker · 2025-11-28T20:46:01Z

src/transformers/models/afmoe/modular_afmoe.py

+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+
+        if self.is_local_attention:


i did not get an answer

ArthurZucker · 2025-11-28T20:46:09Z

src/transformers/models/afmoe/modular_afmoe.py

+        # MoE or dense FFN
+        self.moe_enabled = layer_idx >= config.num_dense_layers
+        if self.moe_enabled:
+            self.mlp = AfmoeMoE(config)
+        else:
+            self.mlp = AfmoeMLP(config)


tests/models/afmoe/__init__.py

ArthurZucker · 2025-11-28T20:46:52Z

tests_output.txt

ArthurZucker

Thanks, we tend to try and remove code path as much as possible, if not done here we 'll do it post release !

ArthurZucker · 2025-11-28T22:01:50Z

src/transformers/models/afmoe/modular_afmoe.py

+        _, selected_experts = torch.topk(scores + expert_bias, k=self.top_k, dim=1)
+        top_scores = scores.gather(dim=1, index=selected_experts)
+
+        if self.route_norm:


is this always True or False? (cf removing code path :)

ArthurZucker · 2025-11-28T22:03:11Z

src/transformers/models/afmoe/modular_afmoe.py

+        return top_scores, selected_experts
+
+
+class AfmoeExperts(nn.ModuleList):


you could just inherti from Mixtral or Qwen2Moe it should be the same no?

The checkpoint weight structure is different in AFMoE

We have an online weight converter but now worries :)

ArthurZucker · 2025-11-28T22:05:00Z

src/transformers/models/afmoe/modular_afmoe.py

+        if isinstance(module, nn.Linear):
+            module.weight.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight[module.padding_idx].zero_()
+        elif isinstance(module, AfmoeRMSNorm):
+            module.weight.fill_(1.0)


these should not be used, can you use nn.init instead please ! one of the ci will fail as we require this for inits!

ArthurZucker · 2025-11-29T09:40:53Z

FAILED tests/models/afmoe/test_modeling_afmoe.py::AfmoeModelTest::test_attention_outputs - TypeError: object of type 'NoneType' has no len()
FAILED tests/models/afmoe/test_modeling_afmoe.py::AfmoeModelTest::test_prompt_lookup_decoding_matches_greedy_search - TypeError: 'NoneType' object is not subscriptable
FAILED tests/models/afmoe/test_modeling_afmoe.py::AfmoeModelTest::test_sample_generate_dict_output - AssertionError: Lists differ: [False, False, False] != [True, True, True]
I think the output attention recorder is wrong. Once fixed we can merge

HuggingFaceDocBuilderDev · 2025-11-29T09:58:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Remove shared expert if else as defaults to 2 Remove `route_norm` as it default to `True`. Make test smaller faster

github-actions · 2025-11-29T11:09:07Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, auto

LysandreJik · 2025-11-29T15:11:25Z

It seems like this model wasn't added to src/transformers/models/__init__.py ?

Rocketknight1 · 2025-12-01T13:41:50Z

Seems like it - how did the CI pass?

Add AFMoE model support

1ae79d2

alyosha-swamy force-pushed the add_afmoe_model branch 4 times, most recently from 6b08d17 to e3ad5e9 Compare November 12, 2025 19:23

Merge remote-tracking branch 'upstream/main' into add_afmoe_model

3a4280c

alyosha-swamy force-pushed the add_afmoe_model branch from e3ad5e9 to 3a4280c Compare November 12, 2025 19:24

ArthurZucker reviewed Nov 14, 2025

View reviewed changes

alyosha-swamy added 2 commits November 14, 2025 14:33

Address review feedback for AFMoE implementation

1314162

Add flex attention support to AFMoE model

8958684

alyosha-swamy force-pushed the add_afmoe_model branch from bcd7b97 to 8958684 Compare November 18, 2025 12:32

ArthurZucker and others added 2 commits November 19, 2025 19:22

Merge branch 'main' into add_afmoe_model

826cb12

Fix expert_bias routing in AFMoE

ecb7438

alyosha-swamy force-pushed the add_afmoe_model branch from e4aa76e to ecb7438 Compare November 21, 2025 15:34

Remove test-results directory

045776d

alyosha-swamy force-pushed the add_afmoe_model branch 2 times, most recently from 8c6bdb4 to 045776d Compare November 21, 2025 16:42

Merge branch 'main' into add_afmoe_model

46ca8d5

ArthurZucker reviewed Nov 28, 2025

View reviewed changes

alyosha-swamy force-pushed the add_afmoe_model branch from 8d78a29 to 02640f4 Compare November 28, 2025 21:31

ArthurZucker approved these changes Nov 28, 2025

View reviewed changes

Address PR review feedback for AFMoE model

30c3a20

alyosha-swamy force-pushed the add_afmoe_model branch from 02640f4 to 30c3a20 Compare November 29, 2025 03:58

fix(afmoe): ensure RMSNorm output dtype matches input dtype)

79d9c9a

properly return attn weights

92ee54c

ArthurZucker added 3 commits November 29, 2025 11:01

fix most tests

8a4049c

cleanup

06ee286

Remove shared expert if else as defaults to 2 Remove `route_norm` as it default to `True`. Make test smaller faster

fix input embeds api

0b3e060

ArthurZucker added 5 commits November 29, 2025 11:24

update rope API, smaller test and should be good to go

3805323

oups wront place to skip unittest

a7849b5

quality

4cc229c

update

4709355

rope parameter docstring fill

3e3b0bb

ArthurZucker added the New model label Nov 29, 2025

ArthurZucker merged commit cac0a28 into huggingface:main Nov 29, 2025
16 of 21 checks passed

		return top_scores, selected_experts


		class AfmoeExperts(nn.ModuleList):

Add afmoe model #42168

Add afmoe model #42168

Conversation

alyosha-swamy commented Nov 12, 2025

Summary

Model Description

Implementation Details

Testing

Documentation

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Nov 29, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 29, 2025

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

Uh oh!

LysandreJik commented Nov 29, 2025

Uh oh!

Rocketknight1 commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants