Skip to content

Commit 2d9e584

Browse files
authored
[megatron] support mlp_padding_free (#5066)
1 parent e9d0456 commit 2d9e584

File tree

4 files changed

+37
-2
lines changed

4 files changed

+37
-2
lines changed

docs/source/Instruction/Megatron-SWIFT训练.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2
4646
这里介绍使用2卡80GiB A100对Qwen2.5-7B-Instruct模型进行自我认知微调的快速入门案例,以下最佳实践可以在10分钟内完成。
4747

4848
首先,我们需要将HF格式的权重转为Megatron格式:
49-
- 若出现OOM,将`CUDA_VISIBLE_DEVICES=0`删除即可。若出现内存不足,请将`--test_convert_precision true`删除。
49+
- 若出现OOM,将`CUDA_VISIBLE_DEVICES=0`删除即可,会自动使用多卡。若出现内存不足,请将`--test_convert_precision true`删除。
5050
```shell
5151
CUDA_VISIBLE_DEVICES=0 \
5252
swift export \
@@ -429,6 +429,8 @@ Megatron训练参数继承自Megatron参数和基本参数。基本参数的内
429429

430430
- add_version: 在`save`上额外增加目录`'<版本号>-<时间戳>'`防止权重覆盖,默认为True。
431431
- padding_free: 将一个batch中的数据进行展平而避免数据padding,从而降低显存占用并加快训练。默认为True。
432+
- 若要自定义attention_mask,你可以设置`--padding_free false`
433+
- mlp_padding_free: 默认为None,若padding_free为True,则默认为False;否则,默认为True。用于padding_free设置为false时,对mlp进行padding_free优化。这可以在自定义attention_mask的同时,提升训练速度和减少显存占用。
432434
- 🔥packing: 是否使用序列packing,默认为False。当前支持`megatron pt/sft`
433435
- packing_cache: 指定 packing 缓存目录。默认值为`None`,表示缓存将存储在环境变量 `$MODELSCOPE_CACHE`所指定的路径下。在跨节点使用 packing 功能时,需确保所有节点的 packing 缓存路径共享且一致。你可以通过设置`MODELSCOPE_CACHE`环境变量,或在命令行中添加 `--packing_cache <shared_path>`参数来实现这一要求。
434436
- 注意:该参数将在"ms-swift>=3.7"被移除。多机packing不再需要设置packing_cache。

docs/source_en/Instruction/Megatron-SWIFT-Training.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ The training module in the dependent library Megatron-LM will be cloned and inst
4747
This section introduces a quick start example for fine-tuning the self-awareness of the Qwen2.5-7B-Instruct model using two 80GiB A100 GPUs. The following best practices can be completed within 10 minutes.
4848

4949
First, we need to convert the weights from HF (Hugging Face) format to Megatron format:
50-
- If OOM (Out of Memory) occurs, simply remove `CUDA_VISIBLE_DEVICES=0`. If you encounter insufficient memory, please remove `--test_convert_precision true`.
50+
- If OOM (Out of Memory) occurs, simply remove `CUDA_VISIBLE_DEVICES=0`; the system will automatically use multiple GPUs. If you encounter insufficient memory, please remove `--test_convert_precision true`.
5151
```shell
5252
CUDA_VISIBLE_DEVICES=0 \
5353
swift export \
@@ -449,6 +449,8 @@ Megatron training parameters inherit from Megatron parameters and basic paramete
449449

450450
- add_version: Adds a directory `<version>-<timestamp>` to `save` to prevent overwriting weights, default is True.
451451
- padding_free: Flattens the data in a batch to avoid padding, thereby reducing memory usage and accelerating training. Default is True.
452+
- If you wish to customize the attention_mask, you can set `--padding_free false`.
453+
- mlp_padding_free: The default is None. If padding_free is True, its default is False; otherwise, the default is True. This is used for applying padding-free optimization to the MLP when padding_free is set to false. It allows for improved training speed and reduced memory usage while customizing the attention_mask.
452454
- 🔥packing: Whether to use sequence packing, defaults to False. Currently supports `megatron pt/sft`.
453455
- packing_cache: Specifies the directory for packing cache. The default value is `None`, which means the cache will be stored in the path defined by the environment variable `$MODELSCOPE_CACHE`. When using the packing feature across multiple nodes, ensure that all nodes share the same packing cache directory. You can achieve this by setting the `MODELSCOPE_CACHE` environment variable or by adding the `--packing_cache <shared_path>` argument in the command line.
454456
- Note: This parameter will be removed in "ms-swift>=3.7". The `packing_cache` setting will no longer be required for multi-node packing.

swift/megatron/argument/megatron_args.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ class ExtraMegatronArguments(RLHFMegatronArgumentsMixin, MegatronTunerMixin):
7777
rope_scaling: Optional[Union[dict, str]] = None
7878
torch_dtype: Optional[torch.dtype] = None
7979
padding_free: bool = True
80+
mlp_padding_free: Optional[bool] = None
8081
# streaming dataloader
8182
dataloader_persistent_workers: bool = True
8283
dataloader_prefetch_factor: int = 10
@@ -268,6 +269,8 @@ class MegatronArguments(ExtraMegatronArguments):
268269
extra_megatron_kwargs: Optional[Union[dict, str]] = None
269270

270271
def _set_default(self):
272+
if self.mlp_padding_free is None:
273+
self.mlp_padding_free = not self.padding_free
271274
if self.lr is None:
272275
if self.train_type == 'full':
273276
self.lr = 1e-5

swift/megatron/init.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -686,11 +686,39 @@ def sharded_state_dict(
686686
peft_module.ModulesToSaveWrapper = NewModulesToSaveWrapper
687687

688688

689+
def _patch_TransformerLayer():
690+
from megatron.core.transformer import TransformerLayer
691+
692+
def forward(self, *args, **kwargs):
693+
"""
694+
Perform a forward pass through the transformer layer.
695+
696+
This method calls the core computation of a transformer layer, including
697+
self-attention, cross-attention (if applicable), and feed-forward operations.
698+
"""
699+
from megatron.training import get_args
700+
hidden_states, context = self._forward_attention(*args, **kwargs)
701+
args = get_args()
702+
mlp_padding_free = args.mlp_padding_free and 'attention_mask' in kwargs
703+
if mlp_padding_free:
704+
mask = (kwargs['attention_mask'].sum(dim=(1, 3)) > 0).t()
705+
hidden_states = hidden_states[mask][:, None]
706+
output = self._forward_mlp(hidden_states, kwargs.get('inference_context', None))
707+
if mlp_padding_free:
708+
new_output = hidden_states.new_zeros((*mask.shape, output.shape[-1]))
709+
new_output[mask] = output.squeeze(1)
710+
output = new_output
711+
return output, context
712+
713+
TransformerLayer.forward = forward
714+
715+
689716
def _patch_megatron():
690717
_patch_transformer_engine()
691718
_patch__batched_p2p_ops()
692719
_patch_mla_attention()
693720
_patch_TEGroupedLinear()
721+
_patch_TransformerLayer()
694722
from swift.megatron import tuners # patch lora
695723
try:
696724
_patch_peft_BaseTuner()

0 commit comments

Comments
 (0)