GEM: A Deep-Learning Workload Optimizer

Overview

Deep learning workloads are dominated by matrix multiplications followed by element-wise operations (bias addition, activations). Standard libraries like rocBLAS and cuBLAS optimize GEMM independently but don't fuse subsequent operations, resulting in:

Multiple kernel launches (overhead)
Intermediate results written to global memory (300+ cycles latency)
Wasted memory bandwidth reading those intermediates back

The Solution: Fuse GEMM + Bias + ReLU into a single Triton kernel, keeping intermediate results in registers/shared memory (<20 cycles latency) and eliminating redundant memory traffic.

Running Triton on my specific GPU (7900 GRE) posed unique compatibility challenges, which I resolved through hours of system-level tuning and compatibility fixes (listed futher down).

Kernels Benchmarked

Two custom Triton kernels were implemented and benchmarked:

GEMM + Bias + ReLU (PyTorch Native)
FUSED GEMM + Bias + ReLU

Each was tested across three matrix shapes:

(M, K, N) = (64, 256, 128)
(M, K, N) = (128, 256, 256)
(M, K, N) = (1024, 1024, 1024)

Motivation

Deep learning workloads often involve repeated GEMM operations followed by bias and activation (ReLU). Standard libraries like rocBLAS optimize GEMM itself but don’t fuse these layers, leading to extra memory traffic. Think: we have to refer constantly to a matrix C after performing matmul on A and B. This matrix C is normally stored in global memory which can take hundreds of cycles to access!!! Now, what if we could keep the value of C in registers/shared memory where the program can access it in 20 cycles or less? (BTW there isn't enough space on registers to only use registers boohoo)

Triton allows building custom GPU kernels with free control over memory layout and fusion—unlocking the potential to reduce memory bandwidth consumption by keeping intermediate results in registers or shared memory.

Setup

Hardware: AMD Radeon RX 7900 GRE
OS: Ubuntu 22.04.5 LTS (Important for compatibility for 7900 GRE) Kernel: 6.8.x
Software: ROCm 6.0.2, Triton, PyTorch, Python 3.10 (Important for compatibility for 7900 GRE) (Also, turn off Secure Boot!!!) Tile Size: BLOCK_M = BLOCK_N = BLOCK_K = 64
Metric: Execution time (ms), averaged over ~100 runs per kernel

Results

Benchmark: M = 64, K = 256, N = 128

GEMM + Bias + ReLU: 0.165 ms
FUSED GEMM + Bias + ReLU: 0.101 ms

Benchmark: M = 128, K = 256, N = 256

GEMM + Bias + ReLU: 0.197 ms
FUSED GEMM + Bias + ReLU: 0.131 ms

Benchmark: M = 1024, K = 1024, N = 1024

GEMM + Bias + ReLU: 0.460 ms
FUSED GEMM + Bias + ReLU: 0.327 ms

Analysis

On average, fusing Bias + ReLU reduced execution time by 33.7%:

Matrix Size	Bias + ReLU Speedup
64 × 256 × 128	38.8%
128 × 256 × 256	33.5%
1024 × 1024 × 1024	28.9%

Takeaways & Tradeoffs

Operator fusion helps reduce global memory traffic.
GPU Memory Hierarchy is very similar to CPU Cache Hierarchy. Difference is that GPU registers are bigger
Fusion has promise, but might see decline in efficiency due to overhead on bigger workloads (think in the millions).

Conclusion

This experiment demonstrates that Triton operator fusion can yield tangible performance gains, especially in inference-heavy pipelines with small batch sizes.

Future work I'm hoping to do:

Testing larger tile sizes
Profiling register/shared memory usage and finding the break-even point

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
benchmarks.py		benchmarks.py
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

GEM: A Deep-Learning Workload Optimizer

Overview

Kernels Benchmarked

Motivation

Setup

Results

Benchmark: M = 64, K = 256, N = 128

Benchmark: M = 128, K = 256, N = 256

Benchmark: M = 1024, K = 1024, N = 1024

Analysis

Takeaways & Tradeoffs

Conclusion

References

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

DanielChaeS/GEM-A-Deep-Learning-Workload-Optimizer

Folders and files

Latest commit

History

Repository files navigation

GEM: A Deep-Learning Workload Optimizer

Overview

Kernels Benchmarked

Motivation

Setup

Results

Benchmark: M = 64, K = 256, N = 128

Benchmark: M = 128, K = 256, N = 256

Benchmark: M = 1024, K = 1024, N = 1024

Analysis

Takeaways & Tradeoffs

Conclusion

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages