[release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations #2894

jataylo · 2025-12-30T09:55:04Z

It has been observed that in the case of heavy contended atomics poor performance is being achieved.

To solve this problem while minimizing kernel overhead this PR proposes an fx pass which will replace the index_put operation with an alternative scatter approach.

Algorithm:

Enumerate scatter operations: operation_id = [0, 1, 2, ..., N-1]
Assign to partitions: partition_id = operation_id % num_partitions
Create expanded buffers along scatter_dim: size = num_partitions × dim_size
Adjust indices: adjusted_idx = original_idx + (partition_id × dim_size)
Perform partitioned scatter with reduced contention
Reduce across partitions: sum(partitions, dim=scatter_dim)

This will reduce atomic contention at the cost of memory usage. In order to combat this we have built heuristics around the total number of partitions for the expanded buffer, as well as setting a cap on how large these expanded tensors can be (currently 10% of GPU memory)

Note the heuristic cannot be perfect as we do not know the true indices data at compile time, in real world models the indices will have duplicates and not be uniformly distributed which increases atomic contention, currently this cannot be modelled and we have to estimate contention based on input and output buffer sizes.

Benchmark code: https://gist.github.com/jataylo/dd3a6353ad2859efd65fa87b28aa3ebd
This code executes 3 index_add ops to 3 seperate buffers.
N = 1000000
D = 100
n = 501

values = float32 [N,D]
indices = int64 [N]
output = float32 [n, D]

For each run we modify the range of randint to simulate various levels of atomic contention

Gathered two sets of results, one with partitioned_scatter_enabled=True, the other partitioned_scatter_enabled=False

MI300

uniform_range	no_compile_ms	compile_ms (partitioned_scatter_enabled=False)	compile_ms (partitioned_scatter_enabled=True)	speedup
0-0	85.52	28.50	3.55	8.03
0-1	46.99	15.66	2.47	6.33
0-3	25.16	8.31	2.20	3.78
0-7	12.92	4.32	1.63	2.66
0-15	6.66	4.24	1.60	2.66
0-31	3.43	3.19	1.33	2.40
0-63	1.79	1.62	1.32	1.23
0-127	1.76	1.59	1.24	1.28
0-255	1.73	1.32	1.24	1.07
0-500	1.61	1.27	1.23	1.04

H100

uniform_range	no_compile_ms	compile_ms (partitioned_scatter_enabled=False)	compile_ms (partitioned_scatter_enabled=True)
0-0	19.842156	5.504691	0.756135
0-1	19.516249	5.526914	0.779779
0-3	10.450396	3.079321	0.702327
0-7	5.417728	1.906823	0.683553
0-15	3.545023	1.570733	0.655154
0-31	2.631531	1.223358	0.641384
0-63	2.629384	0.832651	0.63534
0-127	2.629735	0.726054	0.768287
0-255	2.62846	0.625181	1.317291
0-500	2.629526	0.548981	1.724292

We can see this could potentially benefit H100 on worst-case examples but would degrade perf in the best case, the atomic add cost on MI300 is heavier meaning this is more beneficial.

On MI300 we can see a mixed bag of e2e model improvements
https://hud.pytorch.org/benchmark/v3/dashboard/compiler_inductor?renderGroupId=main&time.start=2025-11-05T00%3A00%3A00.000Z&time.end=2025-12-04T02%3A00%3A00.000Z&filters.repo=pytorch%2Fpytorch&filters.benchmarkName=compiler&filters.mode=training&filters.dtype=amp&filters.deviceName=rocm+%28mi300x%29&filters.device=rocm&filters.suite=all&filters.compiler=default&lcommit.commit=38c42c575d342a7ea6f4a555bf845071e03b5f35&lcommit.workflow_id=19635538449&lcommit.date=2025-11-24T14%3A00%3A00Z&lcommit.branch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&rcommit.commit=fedb7f15d177a259bf25c94e888137e0a9a69a81&rcommit.workflow_id=19856622912&rcommit.date=2025-12-02T12%3A00%3A00Z&rcommit.branch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&lbranch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&rbranch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&maxSampling=110

Due to mixed-bag of results we will initially enable this as non default feature but testing passed CI with this enabled here
https://hud.pytorch.org/pytorch/pytorch/pull/168073?sha=fedb7f15d177a259bf25c94e888137e0a9a69a81

Note there are improvements to make after this lands:

Add dynamic shape support, needs to be conservative here to not explode memory usage.
Update IR and codegen directly to avoid iota op and needing to update indices via torch ops, we can likely do this in store codegen itself.
Develop new implementations for memory constrained environments

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @dllehr-amd @chenyang78

…and linting fixes (cherry picked from commit 4cb344f)

rocm-repo-management-api · 2025-12-30T10:07:30Z

Jenkins build for 913cce10b0702def5f47d91a9f217f70ad1ae339 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Squashed commit: Add partitioned scatter approach with optimizations …

913cce1

…and linting fixes (cherry picked from commit 4cb344f)

jataylo changed the title ~~Squashed commit: Add partitioned scatter approach with optimizations …~~ [release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations Dec 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations #2894

[release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations #2894

Uh oh!

jataylo commented Dec 30, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Dec 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations #2894

Are you sure you want to change the base?

[release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations #2894

Uh oh!

Conversation

jataylo commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jataylo commented Dec 30, 2025 •

edited

Loading

rocm-repo-management-api bot commented Dec 30, 2025 •

edited

Loading