You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: example/gpu/README.md
+48Lines changed: 48 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,54 @@ bpftime provides GPU support through its CUDA/ROCm attachment implementation, en
4
4
5
5
> `Experimental`
6
6
7
+
## Why eBPF on GPU?
8
+
9
+
GPUs are widely used for ML workloads and are typically SIMT (Single Instruction, Multiple Thread) accelerators with threads organized in warps executing on streaming multiprocessors (SMs). These threads are grouped into blocks and launched as kernels, utilizing a complex multi-level memory hierarchy including registers, shared memory/LDS (Local Data Share), L2 cache, and device memory. GPUs also have limited preemption capabilities compared to CPUs. This architectural complexity creates rich but challenging behavior patterns that are difficult to observe and customize, particularly when diagnosing performance bottlenecks, memory access patterns, warp divergence, or resource contention issues.
10
+
11
+
### The Problem with Current GPU Observability Tools
12
+
13
+
Today's GPU tracing and profiling landscape suffers from two major limitations:
Many tracing tools operate at the CPU boundary by placing probes on CUDA userspace libraries (like `libcuda.so`, `libcudart.so`) or kernel drivers. While these tools can capture host-side events such as kernel launches, memory copies, and API calls, they treat the GPU device as a black box. This approach provides:
17
+
- No visibility into what happens *inside* a running kernel
18
+
- Weak or no linkage to device-side events like warp stalls, bank conflicts, or memory traffic patterns
19
+
- No ability to safely adapt or modify kernel behavior in-flight based on runtime conditions
20
+
- Limited correlation between host actions and device-side performance issues
21
+
22
+
**2. GPU-Specific Profilers Are Siloed**
23
+
Device-side profilers like NVIDIA's CUPTI, Intel's GTPin, Nvbit, and Neutrino do provide detailed device-side visibility including instruction-level profiling, memory traces, and warp execution analysis. However, they suffer from:
24
+
-**Vendor lock-in**: Each tool is typically tied to a specific GPU vendor (NVIDIA, AMD, Intel)
25
+
-**Isolation from eBPF ecosystems**: These tools don't integrate with Linux's eBPF infrastructure, making it difficult to correlate GPU events with system-wide observability data from kprobes, uprobes, tracepoints, or network events
26
+
-**Limited programmability**: Most provide fixed metrics rather than user-programmable instrumentation
27
+
-**High overhead**: Binary instrumentation tools can introduce significant performance overhead (e.g., NVBit can be 10-100x slower)
28
+
29
+
### bpftime's Unified Approach
30
+
31
+
**bpftime bridges this gap** by offloading eBPF programs directly into GPU device contexts, bringing the same programmability model that revolutionized kernel observability to GPUs. The implementation includes:
32
+
33
+
**GPU-Side Attach Points:**
34
+
- Device function entry/exit for profiling kernel execution
35
+
- Block begin/end for tracking thread block lifecycle
36
+
- Barrier/synchronization points for analyzing warp coordination
37
+
- Memory operation hooks for capturing access patterns
38
+
- Stream operation events for tracking asynchronous execution
39
+
40
+
**eBPF-to-GPU Compilation Pipeline:**
41
+
- Compiles standard eBPF bytecode into GPU-native instruction sets (PTX for NVIDIA, SPIR-V for AMD)
42
+
- Includes full verifier support to ensure safety and prevent crashes
43
+
- Provides GPU-optimized helper functions for timing, thread identification, and map operations
44
+
- Supports standard eBPF maps (hash, array, ringbuf) with GPU-resident variants for zero-copy access
45
+
46
+
This unified approach enables:
47
+
48
+
-**3-10x faster performance** than tools like NVBit for instrumentation
49
+
-**Vendor-neutral design** that works across NVIDIA and AMD GPUs
50
+
-**Unified observability** with Linux kernel eBPF programs (kprobes, uprobes)
51
+
-**Fine-grained profiling** at the warp or instruction level
52
+
-**Adaptive GPU kernel memory optimization** and programmable scheduling across SMs
53
+
-**Accelerated eBPF applications** by leveraging GPU compute power
0 commit comments