Free almost all of your GPU for what matters: Maximum latent space processing
- Universal .safetensors Support: Native DisTorch2 distribution for all
.safetensors
models. - Up to 10% Faster GGUF Inference versus DisTorch1: The new DisTorch2 logic provides potential speedups for GGUF models versus the DisTorch V1 method.
- Bespoke WanVideoWrapper Integration: Tightly integrated, stable support for WanVideoWrapper with eight bespoke MultiGPU nodes.
- New Model-Driven Allocation Options: Two new inutuitive model-driven Expert Modes to facilitate exact placement on all available devices - 'bytes' and 'ratio'
What is DisTorch? Standing for "distributed torch", the DisTorch nodes in this custom_node provide a way of moving the static parts of your main image generation model known as the UNet
off your main compute card to somewhere slower, but one that is not taking up space that could be better used for longer videos or more concurrent images. By selecting one or more donor devices - main CPU DRAM or another cuda/xps device's VRAM - you can select how much of the model is loaded on that device instead of your main compute
card. Just set how much VRAM you want to free up, and DisTorch handles the rest.
- Two Modes:
- Normal Mode: The standard
virtual_vram_gb
slider continues to let you select one donor device (like your system's RAM) to offload to. The more virtual VRAM you add, the more of the model is pushed to the donor device. Simple and effective. - Expert Mode: For connoisseurs of performance, with two Expert Modes
byte
andratio
that allow you to specify exactly how the model itself is split across all your available devices as well as the legacyfraction
method for your devices to have exact allocations. These modes are all accomplished via a single, flexible text string:- Bytes (Recommended): The most direct way to slice up your model. Inspired by Huggingface's
device_map
, you can specify the exact number of gigabytes or megabytes for each device. The wildcard*
assigns the remainder of the model to a device, making it easy to offload. (The CPU acts as the default wildcard if none are specified.)- Example:
cuda:0,2.5gb;cpu,*
will load the first 2.50GB of the model ontocuda:0
and the rest onto thecpu
. - Example:
cuda:0,500mb;cuda:1,3.0g;cpu,5gb*
will put 0.50GB oncuda:0
, 3.00GB oncuda:1
, and 5.00GB (or the remainder) oncpu
.
- Example:
- Ratio: Love the simplicity of
llama.cpp
'stensor_split
? This mode is for you. Specify a ratio to distribute the model across devices.- Example:
cuda:0,25%;cpu,75%
will split the model in a 1:3 ratio, loading 25% ontocuda:0
and 75% onto thecpu
. - Example:
cuda:0,8%;cuda:1,8%;cpu,4%
uses an 8:8:4 ratio, putting 40% of the model oncuda:0
, 40% oncuda:1
, and 20% oncpu
.
- Example:
- Fraction: The original DisTorch expert mode. This mode splits the model based on the fraction of each device's total VRAM to be used.
- Example:
cuda:0,0.1;cpu,0.5
will use 10% ofcuda:0
's VRAM and 50% of thecpu
's RAM to hold the model. - Example:
cuda:0,0.0207;cuda:1,0.1273;cpu,0.0808
will use 2.1% ofcuda:0
's VRAM, 12.7% ofcuda:1
's VRAM, and 8.1% of thecpu
's RAM to hold the model.
- Example:
- Bytes (Recommended): The most direct way to slice up your model. Inspired by Huggingface's
- Normal Mode: The standard
- Free up GPU VRAM instantly without complex settings
- Run larger models by offloading layers to other system RAM
- Use all your main GPU's VRAM for actual
compute
/ latent processing, or fill it up just enough to suit your needs and the remaining with quick-access model blocks. - Seamlessly distribute .safetensors and GGUF layers across multiple GPUs if available
- Allows you to easily shift from on-device speed to open-device latent space capability with a simple one-number change
DisTorch Nodes with one simple number to tune its Vitual VRAM to your needs
Works with all .safetensors and GGUF-quantized models.
⚙️ Expert users: Like .gguf or exl2/3 LLM loaders, use the expert_mode_alloaction for exact allocations of model shards on as many devices as your setup has!
The new Virtual VRAM even lets you offload ALL of the model and still run compute on your CUDA device!
Installation via ComfyUI-Manager is preferred. Simply search for ComfyUI-MultiGPU
in the list of nodes and follow installation instructions.
Clone this repository inside ComfyUI/custom_nodes/
.
The extension automatically creates MultiGPU versions of loader nodes. Each MultiGPU node has the same functionality as its original counterpart but adds a device
parameter that allows you to specify the GPU to use.
Currently supported nodes (automatically detected if available):
- Standard ComfyUI model loaders:
- CheckpointLoaderSimpleMultiGPU/CheckpointLoaderSimpleDistorch2MultiGPU
- CLIPLoaderMultiGPU
- ControlNetLoaderMultiGPU
- DualCLIPLoaderMultiGPU
- TripleCLIPLoaderMultiGPU
- UNETLoaderMultiGPU/UNETLoaderDisTorch2MultiGPU, and
- VAELoaderMultiGPU
- WanVideoWrapper (requires ComfyUI-WanVideoWrapper):
- WanVideoModelLoaderMultiGPU & WanVideoModelLoaderMultiGPU_2
- WanVideoVAELoaderMultiGPU
- LoadWanVideoT5TextEncoderMultiGPU
- LoadWanVideoClipTextEncoderMultiGPU
- WanVideoTextEncodeMultiGPU
- WanVideoBlockSwapMultiGPU
- WanVideoSamplerMultiGPU
- GGUF loaders (requires ComfyUI-GGUF):
- UnetLoaderGGUFMultiGPU/UnetLoaderGGUFDisTorch2MultiGPU
- UnetLoaderGGUFAdvancedMultiGPU
- CLIPLoaderGGUFMultiGPU
- DualCLIPLoaderGGUFMultiGPU
- TripleCLIPLoaderGGUFMultiGPU
- XLabAI FLUX ControlNet (requires x-flux-comfy):
- LoadFluxControlNetMultiGPU
- Florence2 (requires ComfyUI-Florence2):
- Florence2ModelLoaderMultiGPU
- DownloadAndLoadFlorence2ModelMultiGPU
- LTX Video Custom Checkpoint Loader (requires ComfyUI-LTXVideo):
- LTXVLoaderMultiGPU
- NF4 Checkpoint Format Loader(requires ComfyUI_bitsandbytes_NF4):
- CheckpointLoaderNF4MultiGPU
- HunyuanVideoWrapper (requires ComfyUI-HunyuanVideoWrapper):
- HyVideoModelLoaderMultiGPU
- HyVideoVAELoaderMultiGPU
- DownloadAndLoadHyVideoTextEncoderMultiGPU
All MultiGPU nodes available for your install can be found in the "multigpu" category in the node menu.
All workflows have been tested on a 2x 3090 + 1060ti linux setup, a 4070 win 11 setup, and a 3090/1070ti linux setup.
- Default DisTorch2 Workflow
- FLUX.1-dev Example
- Hunyuan GGUF Example
- LTX Video Text-to-Video
- Qwen Image Basic Example
- WanVideo 2.2 Example
- FLUX.1-dev 2-GPU GGUF
- Hunyuan 2-GPU GGUF
- Hunyuan CPU+GPU GGUF
- Hunyuan GGUF DisTorch
- Hunyuan GGUF MultiGPU
If you encounter problems, please open an issue. Attach the workflow if possible.
Currently maintained by pollockjj. Originally created by Alexander Dzhoganov. With deepest thanks to City96.