Skip to content

Conversation

@jataylo
Copy link
Collaborator

@jataylo jataylo commented Dec 30, 2025

From pytorch#168073

It has been observed that in the case of heavy contended atomics poor performance is being achieved.

To solve this problem while minimizing kernel overhead this PR proposes an fx pass which will replace the index_put operation with an alternative scatter approach.

Algorithm:

  1. Enumerate scatter operations: operation_id = [0, 1, 2, ..., N-1]
  2. Assign to partitions: partition_id = operation_id % num_partitions
  3. Create expanded buffers along scatter_dim: size = num_partitions × dim_size
  4. Adjust indices: adjusted_idx = original_idx + (partition_id × dim_size)
  5. Perform partitioned scatter with reduced contention
  6. Reduce across partitions: sum(partitions, dim=scatter_dim)

This will reduce atomic contention at the cost of memory usage. In order to combat this we have built heuristics around the total number of partitions for the expanded buffer, as well as setting a cap on how large these expanded tensors can be (currently 10% of GPU memory)

Note the heuristic cannot be perfect as we do not know the true indices data at compile time, in real world models the indices will have duplicates and not be uniformly distributed which increases atomic contention, currently this cannot be modelled and we have to estimate contention based on input and output buffer sizes.

Benchmark code: https://gist.github.com/jataylo/dd3a6353ad2859efd65fa87b28aa3ebd
This code executes 3 index_add ops to 3 seperate buffers.
N = 1000000
D = 100
n = 501

values = float32 [N,D]
indices = int64 [N]
output = float32 [n, D]

For each run we modify the range of randint to simulate various levels of atomic contention

Gathered two sets of results, one with partitioned_scatter_enabled=True, the other partitioned_scatter_enabled=False

MI300

uniform_range no_compile_ms compile_ms (partitioned_scatter_enabled=False) compile_ms (partitioned_scatter_enabled=True) speedup
0-0 85.52 28.50 3.55 8.03
0-1 46.99 15.66 2.47 6.33
0-3 25.16 8.31 2.20 3.78
0-7 12.92 4.32 1.63 2.66
0-15 6.66 4.24 1.60 2.66
0-31 3.43 3.19 1.33 2.40
0-63 1.79 1.62 1.32 1.23
0-127 1.76 1.59 1.24 1.28
0-255 1.73 1.32 1.24 1.07
0-500 1.61 1.27 1.23 1.04

H100

uniform_range no_compile_ms compile_ms   (partitioned_scatter_enabled=False) compile_ms   (partitioned_scatter_enabled=True)
0-0 19.842156 5.504691 0.756135
0-1 19.516249 5.526914 0.779779
0-3 10.450396 3.079321 0.702327
0-7 5.417728 1.906823 0.683553
0-15 3.545023 1.570733 0.655154
0-31 2.631531 1.223358 0.641384
0-63 2.629384 0.832651 0.63534
0-127 2.629735 0.726054 0.768287
0-255 2.62846 0.625181 1.317291
0-500 2.629526 0.548981 1.724292

We can see this could potentially benefit H100 on worst-case examples but would degrade perf in the best case, the atomic add cost on MI300 is heavier meaning this is more beneficial.

On MI300 we can see a mixed bag of e2e model improvements
https://hud.pytorch.org/benchmark/v3/dashboard/compiler_inductor?renderGroupId=main&time.start=2025-11-05T00%3A00%3A00.000Z&time.end=2025-12-04T02%3A00%3A00.000Z&filters.repo=pytorch%2Fpytorch&filters.benchmarkName=compiler&filters.mode=training&filters.dtype=amp&filters.deviceName=rocm+%28mi300x%29&filters.device=rocm&filters.suite=all&filters.compiler=default&lcommit.commit=38c42c575d342a7ea6f4a555bf845071e03b5f35&lcommit.workflow_id=19635538449&lcommit.date=2025-11-24T14%3A00%3A00Z&lcommit.branch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&rcommit.commit=fedb7f15d177a259bf25c94e888137e0a9a69a81&rcommit.workflow_id=19856622912&rcommit.date=2025-12-02T12%3A00%3A00Z&rcommit.branch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&lbranch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&rbranch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&maxSampling=110

Due to mixed-bag of results we will initially enable this as non default feature but testing passed CI with this enabled here
https://hud.pytorch.org/pytorch/pytorch/pull/168073?sha=fedb7f15d177a259bf25c94e888137e0a9a69a81

Note there are improvements to make after this lands:

  1. Add dynamic shape support, needs to be conservative here to not explode memory usage.
  2. Update IR and codegen directly to avoid iota op and needing to update indices via torch ops, we can likely do this in store codegen itself.
  3. Develop new implementations for memory constrained environments

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @dllehr-amd @chenyang78

@jataylo jataylo changed the title Squashed commit: Add partitioned scatter approach with optimizations … [release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations Dec 30, 2025
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Dec 30, 2025

Jenkins build for 913cce10b0702def5f47d91a9f217f70ad1ae339 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant