Skip to content

v3.7.0

Compare
Choose a tag to compare
@Jintao-Huang Jintao-Huang released this 07 Aug 07:05
· 179 commits to main since this release

中文版

新特性

  1. GRPO:
    a. 支持GSPO算法,在GRPO训练中使用参数--importance_sampling_level sequence,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/GSPO.html
    b. GRPO server mode 支持多机 rollout,支持传入多个 vllm_server_host/port,参考脚本:https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/server_multi_node.sh
    c. GRPO rollout 兼容 GYM 环境规范(感谢开发者Mouse的贡献),参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/GYM%E7%8E%AF%E5%A2%83%E8%AE%AD%E7%BB%83.html
    d. GRPO 支持 entropy_mask 来过滤低熵token损失计算,同时logger支持记录熵值动态,参考文档https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/entropy_mask.html
    e. 支持多轮算法DeepEyes训练,文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/deepeyes.html
    f. GRPO 支持--truncation_strategy delete,删除输入长度超过max_length的数据,并重新采样。
  2. Megatron-SWIFT:
    a. 支持使LoRA训练,现支持CPT/SFT/DPO,显著加速MoE训练速度。
    - 文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html#lora
    - 训练脚本:https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/lora
    b. 支持loss scale,方便Agent训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/loss_scale.sh
    c. 默认megatron-core版本升级至0.13。
    d. 支持bshd格式,方便自定义attention_mask。
    e. 日志优化:新增GPU占用、剩余训练时间等信息打印,并输出logging.jsonl存储训练日志。
    f. 模型加载与转换速度优化,并增加模型加载进度条。
  3. 训练:
    a. 支持Flash-Attention-3(含Megatron-SWIFT),训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3
    b. 新增--new_speical_tokens参数,方便新增特殊tokens。训练脚本参考: https://github.com/modelscope/ms-swift/tree/main/examples/train/new_special_tokens
    c. 新增--cached_dataset参数,支持CPT/SFT的离线tokenize。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset
    d. 序列Packing模块重构。加速Packing速度,并对多模态packing的磁盘存储问题优化。
    e. 支持Qwen2.5-VL混合模态数据(即单条数据中含多种模态) + deepspeed训练。
    f. 多模态模型训练支持 loss_scale。
    g. rope_scaling 支持传入字典,此外支持设置 max_model_len 对 rope_scaling 的 factor 自动调整。
    h. 支持DeepSpeed-AutoTP(该技术不支持LoRA)。
    i. 多模态Packing兼容 transformers>=4.53;序列并行兼容 transformers>=4.52。
    j. resume_only_model默认将进行数据跳过,并使用ignore_data_skip参数进行控制。
    k. MoE模型训练支持 router_aux_loss_coef 参数。
    l. template新增max_length裁剪保护机制,不对图像/视频等tokens进行裁剪。
    m. tuner_backend unsloth 支持moe模型、device_map和DDP。
    n. embedding训练支持liger_kernel。
  4. RLHF:
    a. 支持MPO训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/mpo.sh
    b. 多模态DPO支持了拒绝图片输入,在数据集中加入rejected_images列。
  5. 推理部署:
    a. 支持embedding系列模型的推理部署,包括pt/vllm/sglang的infer_backend。部署脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/deploy/embedding
    b. InferEngine支持return_details参数,以输出prompt_token_ids和token_ids。
    c. vLLM推理引擎兼容更多多模态模型:ovis2, glm4_1v, keye-vl, kimi-vl, glm4v, phi4-multimodal, llama4。
    d. vLLM参数重构,参数名前加入vllm_前缀。GRPO模块复用vLLM参数。
  6. 导出:
    a. QLoRA支持Merge-LoRA,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora
    b. 支持MoE/多模态模型的FP8/BNB量化,脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize

新模型

  1. 纯文本模型:
    a. Qwen/Qwen3-235B-A22B-[Instruct/Thinking]-2507, Qwen/Qwen3-Coder-480B-A35B-Instruct, Qwen/Qwen3-4B-[Instruct/Thinking]-2507系列(含Megatron-SWIFT),训练脚本参考:#5033
    b. openai-mirror/gpt-oss-20b系列,最佳实践参考:#5277
    c. ZhipuAI/GLM-4.5系列(含Megatron-SWIFT),训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/glm4_5_106b.sh
    d. Hunyuan-7B-Instruct系列,最佳实践参考:#5236
    e. mistralai/Devstral-Small-2505
  2. 多模态模型:
    a. OpenBMB/MiniCPM-V-4,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/minicpmv/train.sh

English Version

New Features

  1. GRPO
    a. Added support for the GSPO algorithm. Use --importance_sampling_level sequence during GRPO training. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/GSPO.html
    b. GRPO “server mode” now supports multi-node rollout; pass in multiple vllm_server_host/port. Example script: https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/server_multi_node.sh
    c. GRPO rollout is now GYM-compatible (thanks to contributor Mouse). Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html
    d. Added entropy_mask for filtering low-entropy tokens during loss computation, and the logger now tracks entropy dynamics. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/entropy_mask.html
    e. Added support for the multi-round DeepEyes algorithm. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/deepeyes.html
    f. GRPO supports --truncation_strategy delete: remove samples whose input length exceeds max_length and resample.
  2. Megatron-SWIFT
    a. Added LoRA training (CPT/SFT/DPO) to significantly accelerate MoE training.
    - Docs: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html#lora-training
    - Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/lora
    b. Added loss-scaling to simplify Agent training. Script: https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/loss_scale.sh
    c. Default megatron-core upgraded to 0.13.
    d. Added bshd tensor format to facilitate custom attention_mask.
    e. Logging improvements: prints GPU memory, estimated remaining time, and writes logging.jsonl.
    f. Faster model loading & conversion plus a progress bar.
  3. Training
    a. Added Flash-Attention-3 support (including Megatron-SWIFT). Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/flash_attention_3
    b. New --new_special_tokens flag for adding special tokens. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/new_special_tokens
    c. New --cached_dataset flag for offline tokenization in CPT/SFT. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/export/cached_dataset
    d. Re-implemented the sequence-packing module for faster packing and better multimodal disk I/O.
    e. Qwen2.5-VL hybrid-modal data (multiple modalities in a single sample) + DeepSpeed training supported.
    f. Multimodal training now supports loss-scaling.
    g. rope_scaling now accepts a dict; max_model_len can auto-adjust the scaling factor.
    h. Added DeepSpeed-AutoTP (not compatible with LoRA).
    i. Multimodal packing is compatible with transformers ≥ 4.53; sequence parallelism with transformers ≥ 4.52.
    j. With resume_only_model, data skipping is enabled by default; control via ignore_data_skip.
    k. MoE training supports router_aux_loss_coef.
    l. Template files get a max_length clipping safeguard (no clipping of image/video tokens).
    m. tuner_backend unsloth now supports MoE models, device_map, and DDP.
    n. Embedding training supports liger_kernel.
  4. RLHF
    a. Added MPO training. Script: https://github.com/modelscope/ms-swift/blob/main/examples/train/rlhf/mpo.sh
    b. Multimodal DPO can now reject image inputs by adding a rejected_images column.
  5. Inference & Deployment
    a. Added deployment for embedding models across pt/vllm/sglang back-ends. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/deploy/embedding
    b. InferEngine supports return_details to output prompt_token_ids and token_ids.
    c. vLLM back-end now supports more multimodal models: ovis2, glm4_1v, keye-vl, kimi-vl, glm4v, phi4-multimodal, llama4.
    d. vLLM arguments refactored: all start with the vllm_ prefix. GRPO module reuses the same options.
  6. Export
    a. QLoRA now supports Merge-LoRA. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/qlora
    b. Added FP8 / BNB quantization for MoE and multimodal models. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/export/quantize

New Models

  1. Text-only
    a. Qwen/Qwen3-235B-A22B-[Instruct/Thinking]-2507, Qwen/Qwen3-Coder-480B-A35B-Instruct, and Qwen/Qwen3-4B-[Instruct/Thinking]-2507 (Megatron-SWIFT supported). Training script: #5033
    b. openai-mirror/gpt-oss-20b family. Best-practice: #5277
    c. ZhipuAI/GLM-4.5 family (Megatron-SWIFT supported). Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/megatron/lora/glm4_5_106b.sh
    d. Hunyuan-7B-Instruct family. Best-practice: #5236
    e. mistralai/Devstral-Small-2505
  2. Multimodal
    a. OpenBMB/MiniCPM-V-4. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/models/minicpmv/train.sh

What's Changed

New Contributors

Full Changelog: v3.6.0...v3.7.0