Skip to content

Reddit DisTorch2 Article

John Pollock edited this page Aug 28, 2025 · 1 revision

Hello again, ComfyUI community! This is the maintainer of the ComfyUI-MultiGPU custom_node, back with another update.

About seven months ago, I shared the first iteration of DisTorch (Distributed Torch), a method focused on taking GGUF-quantized UNets (like FLUX or Wan Video) and spreading their GGML layers across multiple devices—secondary GPUs, system RAM—to free up your main compute device. This direct mapping of tensors is an alternative to Comfy's internal --lowvram solution, as it relies on static mapping of tensors in a "MultiGPU aware" fashion, allowing for both DRAM and other VRAM donors. I appreciate all the feedback from the .gguf version and believe it has helped many of you achieve the lowest VRAM footprint possible for your workflows.

But if you're anything like me, you immediately started thinking, "Okay, that works for .gguf. . . what about everything else?"

I'm excited to announce that this release moves beyond city96's .gguf loaders. Enter DisTorch 2.0. This update expands the memory management toolset for Core loaders in ComfyUI - making them MultiGPU aware as before, but now additionally offering powerful new static model allocation tools for both high-end multi-GPU rigs and those struggling with low-VRAM setups.

There’s an article ahead detailing the new features, but for those of you eager to jump in:

TL;DR?

DisTorch 2.0 is here, and the biggest news is Universal .safetensor Support. You can now split any standard, Comfy-loader-supported FP16/BF16/FP8 .safetensor model across your devices, just like ComfyUI-MultGPU did before with GGUFs. This isn't model-specific; it’s universal support for Comfy Core loaders. Furthermore, I took what I learned while optimizing the .gguf analysis code and the underlying logic for all models uses that new optimized core, offering up to 10% faster GGUF inference for offloaded models compared to DisTorch V1. I’ve also introduced new, intuitive Expert Allocation Modes ('bytes' and 'ratio') inspired by HuggingFace and llama.cpp, and added bespoke integration for WanVideoWrapper, allowing you to - among other things - to block swap to other VRAM in your system. The goal for this custom_node remains the same: Stop using your expensive compute card for model storage and unleash it on as much latent space as it can handle. Have fun!

What’s New in V2?

The core concept remains the same: move the static parts of the UNet off your main card so you can use that precious VRAM for computation. However, we've implemented four key advancements.

1. Universal .safetensors Support (The Big One)

The biggest limitation of the previous DisTorch release was its reliance on the GGUF format. While GGUF is fantastic, the vast majority of models we use daily are standard .safetensors.

DisTorch 2.0 changes that.

Why does this matter? Previously, if you wanted to run a 25GB FP16 model on a 24GB card (looking at you, 3090 owners trying to run full-quality Hunyuan Video or FLUX.1-dev), you had to use quantization or rely on ComfyUI’s standard --lowvram mode. Now, let me put in a plug for comfyanon and the excellent code the team there have implemented for low VRAM folks. I don't see the DisTorch2 method replacing this mode for most users who use it and see great results. That said, it is a dynamic method, meaning that depending on what is also going on with your ComfyUI system, more or less of the model may be shuffling between DRAM and VRAM. In cases where LoRAs are interacting with lower-precision models (i.e. .fp8) I have personally seen inconsistent results with LoRA application (due to how --lowvram stores the patched layers back in .fp8 precision on CPU for a .fp8 base model).

The solution to the potentially non-deterministic nature of --lowvram mode that I offer in ComfyUI-MultiGPU is to follow the Load-Patch-Distribute(LPD) method. In short:

  1. Load each new tensor for the first time on the compute device,
  2. Patch the tensor with all applicable LoRA patches on compute,
  3. Distribute that new FP16 tensor to either another VRAM device or CPU at the FP16 level.

This new method, implemented as DisTorch2, allows you to use the new CheckpointLoaderSimpleDistorch2MultiGPU or UNETLoaderDisTorch2MultiGPU nodes to load any standard checkpoint and distribute its layers. You can take that 25GB .safetensor file and say, "Put 5GB on my main GPU, and the remaining 20GB in system RAM, and patch these LoRAs." It loads, and it just works.

(ComfyUI is well-written code, and when expanding DisTorch to .safetensors in Comfy Core, it was mostly just a matter of figuring out how to work with or for Comfy's core tools instead against or outside of them. Failing to do so usually resulted in something that was too janky to move forward with even though it may have worked. I am happy to say that I believe I've found the best, most stable way to offer static model sharding and I am excited for all of you to try it out.)

2. Faster GGUF Inference

While implementing the .safetensor support, I refactored the core DisTorch logic. This new implementation (DisTorch2) isn't just more flexible; it’s faster. When using the new GGUF DisTorch2 nodes, my own n=1 testing showed improvements up to 10% in inference speed compared to the legacy DisTorch V1 nodes. If you were already using DisTorch for GGUFs, this update should give you a nice little boost.

3. New Model-Driven Allocation (Expert Modes Evolved)

The original DisTorch used a "fraction" method in expert mode, where you specified what fraction of your device's VRAM to use. This was functional but often unintuitive.

DisTorch 2.0 introduces two new, model-centric Expert Modes: bytes and ratio. These let you define how the model itself is split, regardless of the hardware it's running on.

Bytes Mode (Recommended)

Inspired by Huggingface's device_map, this is the most direct way to slice up your model. You specify the exact amount (in GB or MB) to load onto each device.

  • Example: cuda:0,2.5gb;cpu,*
  • This loads the first 2.50GB of the model onto cuda:0 and the remainder (* wildcard) onto the cpu.
  • Example: cuda:0,500mb;cuda:1,3.0g;cpu,*
  • This puts 0.50GB on cuda:0, 3.00GB on cuda:1, and the rest on cpu.

Ratio Mode

If you've used llama.cpp's tensor_split, this will feel familiar. You distribute the model based on a ratio.

  • Example: cuda:0,25%;cpu,75%
  • A 1:3 split. 25% of the model layers on cuda:0, 75% on cpu.

These new modes give you the granular control needed to perfectly balance the trade-off between on-device speed and open-device latent space capability.

4. Bespoke WanVideoWrapper Integration

The WanVideoWrapper nodes by kijai are excellent, offering specific optimizations and memory management. Ensuring MultiGPU plays nicely with these specialized wrappers is always a priority. In this release, we've added eight bespoke MultiGPU nodes specifically for WanVideoWrapper, ensuring tight integration and stability when distributing those heavy video models, with the most significant allowing for using kijai's native block swapping of the model with other VRAM devices.

The Goal: Maximum Latent Space for Everyone

distorch_average

The core philosophy behind ComfyUI-MultiGPU remains the same: Use the entirety of your compute card for latent processing.

This update is designed to help two distinct groups of users:

1. The Low-VRAM Community

If you're struggling with OOM errors on an older or smaller card, DisTorch 2.0 lets you push almost the entire model off your main device.

Yes, there is a speed penalty when transferring layers from system RAM—there's no free lunch. But this trade-off is about capability. It allows you to generate images or videos at resolutions or batch sizes that were previously impossible. You can even go all the way down to a "Zero-Load" configuration.

Screenshot 2025-08-27 035419

The new Virtual VRAM even lets you offload ALL of the model and still run compute on your CUDA device!

  1. The Multi-GPU Power Users

If you have multiple GPUs, the new expert modes allow you to treat your secondary cards as high-speed attached storage. By using bytes mode, you can fine-tune the distribution to maximize the throughput of your PCIe bus or NVLink, ensuring your main compute device is never waiting for the next layer, while still freeing up gigabytes of VRAM for massive video generations or huge parallel batches.

Conclusion and Call for Testing

With native .safetensor splitting, faster GGUF processing, and granular allocation controls, I hope DisTorch 2.0 represents a significant step forward in managing large diffusion models in ComfyUI.

While I've tested this extensively on my own setups (Linux and Win11, mixed GPU configurations), ComfyUI runs on a massive variety of hardware, from potato:0 to Threadripper systems. I encourage everyone to update the custom_node, try out the new DisTorch2 loaders (look for DisTorch2 in the name), and experiment with the new allocation modes.

Please continue to provide feedback and report issues on the GitHub repository. Let's see what you can generate!