Skip to content

modelscope/DiffSynth-Studio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiffSynth-Studio

modelscope%2FDiffSynth-Studio | Trendshift

PyPI license open issues GitHub pull-requests GitHub latest commit

切换到中文

Introduction

Welcome to the magic world of Diffusion models! DiffSynth-Studio is an open-source Diffusion model engine developed and maintained by ModelScope team. We aim to foster technical innovation through framework development, bring together the power of the open-source community, and explore the limits of generative models!

DiffSynth currently includes two open-source projects:

  • DiffSynth-Studio: Focused on aggressive technical exploration, for academia, providing support for more cutting-edge model capabilities.
  • DiffSynth-Engine: Focused on stable model deployment, for industry, offering higher computing performance and more stable features.

Installation

Install from source (recommended):

git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .
Other installation methods

Install from PyPI (version updates may be delayed; for latest features, install from source)

pip install diffsynth

If you meet problems during installation, they might be caused by upstream dependencies. Please check the docs of these packages:

Basic Framework

DiffSynth-Studio redesigns the inference and training pipelines for mainstream Diffusion models (including FLUX, Wan, etc.), enabling efficient memory management and flexible model training.

FLUX Series

Detail page: ./examples/flux/

Image

Quick Start
import torch
from diffsynth.pipelines.flux_image_new import FluxImagePipeline, ModelConfig

pipe = FluxImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/"),
        ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors"),
    ],
)

image = pipe(prompt="a cat", seed=0)
image.save("image.jpg")
Model Overview
Model ID Extra Parameters Inference Low VRAM Inference Full Training Validate After Full Training LoRA Training Validate After LoRA Training
FLUX.1-dev code code code code code code
FLUX.1-Kontext-dev kontext_images code code code code code code
FLUX.1-dev-Controlnet-Inpainting-Beta controlnet_inputs code code code code code code
FLUX.1-dev-Controlnet-Union-alpha controlnet_inputs code code code code code code
FLUX.1-dev-Controlnet-Upscaler controlnet_inputs code code code code code code
FLUX.1-dev-IP-Adapter ipadapter_images, ipadapter_scale code code code code code code
FLUX.1-dev-InfiniteYou infinityou_id_image, infinityou_guidance, controlnet_inputs code code code code code code
FLUX.1-dev-EliGen eligen_entity_prompts, eligen_entity_masks, eligen_enable_on_negative, eligen_enable_inpaint code code - -
FLUX.1-dev-LoRA-Encoder lora_encoder_inputs, lora_encoder_scale code code code code - -
Step1X-Edit step1x_reference_image code code code code code code
FLEX.2-preview flex_inpaint_image, flex_inpaint_mask, flex_control_image, flex_control_strength, flex_control_stop code code code code code code

Wan Series

Detail page: ./examples/wanvideo/

video1.mp4
Quick Start
import torch
from diffsynth import save_video
from diffsynth.pipelines.wan_video_new import WanVideoPipeline, ModelConfig

pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="diffusion_pytorch_model*.safetensors", offload_device="cpu"),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth", offload_device="cpu"),
        ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="Wan2.1_VAE.pth", offload_device="cpu"),
    ],
)
pipe.enable_vram_management()

video = pipe(
    prompt="A documentary photography style scene: a lively puppy rapidly running on green grass. The puppy has brown-yellow fur, upright ears, and looks focused and joyful. Sunlight shines on its body, making the fur appear soft and shiny. The background is an open field with occasional wildflowers, and faint blue sky and clouds in the distance. Strong sense of perspective captures the motion of the puppy and the vitality of the surrounding grass. Mid-shot side-moving view.",
    negative_prompt="Bright colors, overexposed, static, blurry details, subtitles, style, artwork, image, still, overall gray, worst quality, low quality, JPEG compression artifacts, ugly, deformed, extra fingers, poorly drawn hands, poorly drawn face, malformed limbs, fused fingers, still frame, messy background, three legs, crowded background people, walking backwards",
    seed=0, tiled=True,
)
save_video(video, "video1.mp4", fps=15, quality=5)
Model Overview
Model ID Extra Parameters Inference Full Training Validate After Full Training LoRA Training Validate After LoRA Training
Wan-AI/Wan2.1-T2V-1.3B code code code code code
Wan-AI/Wan2.1-T2V-14B code code code code code
Wan-AI/Wan2.1-I2V-14B-480P input_image code code code code code
Wan-AI/Wan2.1-I2V-14B-720P input_image code code code code code
Wan-AI/Wan2.1-FLF2V-14B-720P input_image, end_image code code code code code
PAI/Wan2.1-Fun-1.3B-InP input_image, end_image code code code code code
PAI/Wan2.1-Fun-1.3B-Control control_video code code code code code
PAI/Wan2.1-Fun-14B-InP input_image, end_image code code code code code
PAI/Wan2.1-Fun-14B-Control control_video code code code code code
PAI/Wan2.1-Fun-V1.1-1.3B-Control control_video, reference_image code code code code code
PAI/Wan2.1-Fun-V1.1-14B-Control control_video, reference_image code code code code code
PAI/Wan2.1-Fun-V1.1-1.3B-InP input_image, end_image code code code code code
PAI/Wan2.1-Fun-V1.1-14B-InP input_image, end_image code code code code code
PAI/Wan2.1-Fun-V1.1-1.3B-Control-Camera control_camera_video, input_image code code code code code
PAI/Wan2.1-Fun-V1.1-14B-Control-Camera control_camera_video, input_image code code code code code
iic/VACE-Wan2.1-1.3B-Preview vace_control_video, vace_reference_image code code code code code
Wan-AI/Wan2.1-VACE-1.3B vace_control_video, vace_reference_image code code code code code
Wan-AI/Wan2.1-VACE-14B vace_control_video, vace_reference_image code code code code code
DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1 motion_bucket_id code code code code code

More Models

Image Generation Models

Detail page: ./examples/image_synthesis/

FLUX Stable Diffusion 3
image_1024_cfg image_1024
Kolors Hunyuan-DiT
image_1024 image_1024
Stable Diffusion Stable Diffusion XL
1024 1024
Video Generation Models
video_girl.mp4
video.mp4
cogvideo.mp4
Image Quality Assessment Models

We have integrated a series of image quality assessment models. These models can be used for evaluating image generation models, alignment training, and similar tasks.

Detail page: ./examples/image_quality_metric/

Innovative Achievements

DiffSynth-Studio is not just an engineering model framework, but also a platform for incubating innovative results.

Nexus-Gen: Unified Architecture for Image Understanding, Generation, and Editing

ArtAug: Aesthetic Enhancement for Image Generation Models
FLUX.1-dev FLUX.1-dev + ArtAug LoRA
image_1_base image_1_enhance
EliGen: Precise Image Region Control
Entity Control Mask Generated Image
eligen_example_2_mask_0 eligen_example_2_0
ExVideo: Extended Training for Video Generation Models
github_title.mp4
Diffutoon: High-Resolution Anime-Style Video Rendering
Diffutoon.mp4
DiffSynth: The Initial Version of This Project
winter_stone.mp4

Update History

  • July 11, 2025 🔥🔥🔥 We propose Nexus-Gen, a unified model that synergizes the language reasoning capabilities of LLMs with the image synthesis power of diffusion models. This framework enables seamless image understanding, generation, and editing tasks.

  • June 15, 2025 ModelScope's official evaluation framework, EvalScope, now supports text-to-image generation evaluation. Try it with the Best Practices guide.

  • March 31, 2025 We support InfiniteYou, an identity preserving method for FLUX. Please refer to ./examples/InfiniteYou/ for more details.

  • March 25, 2025 Our new open-source project, DiffSynth-Engine, is now open-sourced! Focused on stable model deployment. Geared towards industry. Offers better engineering support, higher computational performance, and more stable functionality.

  • March 13, 2025 We support HunyuanVideo-I2V, the image-to-video generation version of HunyuanVideo open-sourced by Tencent. Please refer to ./examples/HunyuanVideo/ for more details.

  • February 25, 2025 We support Wan-Video, a collection of SOTA video synthesis models open-sourced by Alibaba. See ./examples/wanvideo/.

  • February 17, 2025 We support StepVideo! State-of-the-art video synthesis model! See ./examples/stepvideo.

  • December 31, 2024 We propose EliGen, a novel framework for precise entity-level controlled text-to-image generation, complemented by an inpainting fusion pipeline to extend its capabilities to image inpainting tasks. EliGen seamlessly integrates with existing community models, such as IP-Adapter and In-Context LoRA, enhancing its versatility. For more details, see ./examples/EntityControl.

  • December 19, 2024 We implement advanced VRAM management for HunyuanVideo, making it possible to generate videos at a resolution of 129x720x1280 using 24GB of VRAM, or at 129x512x384 resolution with just 6GB of VRAM. Please refer to ./examples/HunyuanVideo/ for more details.

  • December 18, 2024 We propose ArtAug, an approach designed to improve text-to-image synthesis models through synthesis-understanding interactions. We have trained an ArtAug enhancement module for FLUX.1-dev in the format of LoRA. This model integrates the aesthetic understanding of Qwen2-VL-72B into FLUX.1-dev, leading to an improvement in the quality of generated images.

  • October 25, 2024 We provide extensive FLUX ControlNet support. This project supports many different ControlNet models that can be freely combined, even if their structures differ. Additionally, ControlNet models are compatible with high-resolution refinement and partition control techniques, enabling very powerful controllable image generation. See ./examples/ControlNet/.

  • October 8, 2024. We release the extended LoRA based on CogVideoX-5B and ExVideo. You can download this model from ModelScope or HuggingFace.

  • August 22, 2024. CogVideoX-5B is supported in this project. See here. We provide several interesting features for this text-to-video model, including

    • Text to video
    • Video editing
    • Self-upscaling
    • Video interpolation
  • August 22, 2024. We have implemented an interesting painter that supports all text-to-image models. Now you can create stunning images using the painter, with assistance from AI!

  • August 21, 2024. FLUX is supported in DiffSynth-Studio.

    • Enable CFG and highres-fix to improve visual quality. See here
    • LoRA, ControlNet, and additional models will be available soon.
  • June 21, 2024. We propose ExVideo, a post-tuning technique aimed at enhancing the capability of video generation models. We have extended Stable Video Diffusion to achieve the generation of long videos up to 128 frames.

  • June 13, 2024. DiffSynth Studio is transferred to ModelScope. The developers have transitioned from "I" to "we". Of course, I will still participate in development and maintenance.

  • Jan 29, 2024. We propose Diffutoon, a fantastic solution for toon shading.

    • Project Page
    • The source codes are released in this project.
    • The technical report (IJCAI 2024) is released on arXiv.
  • Dec 8, 2023. We decide to develop a new Project, aiming to release the potential of diffusion models, especially in video synthesis. The development of this project is started.

  • Nov 15, 2023. We propose FastBlend, a powerful video deflickering algorithm.

  • Oct 1, 2023. We release an early version of this project, namely FastSDXL. A try for building a diffusion engine.

    • The source codes are released on GitHub.
    • FastSDXL includes a trainable OLSS scheduler for efficiency improvement.
      • The original repo of OLSS is here.
      • The technical report (CIKM 2023) is released on arXiv.
      • A demo video is shown on Bilibili.
      • Since OLSS requires additional training, we don't implement it in this project.
  • Aug 29, 2023. We propose DiffSynth, a video synthesis framework.

About

Enjoy the magic of Diffusion models!

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages