docs: Expand GPU README with detailed explanations on eBPF integration and profiling capabilities

yunwei37 · yunwei37 · commit edf049a8fe51 · 2025-09-30T01:31:07.000-07:00
diff --git a/example/gpu/README.md b/example/gpu/README.md
@@ -4,6 +4,54 @@ bpftime provides GPU support through its CUDA/ROCm attachment implementation, en
 
 > `Experimental`
 
+## Why eBPF on GPU?
+
+GPUs are widely used for ML workloads and are typically SIMT (Single Instruction, Multiple Thread) accelerators with threads organized in warps executing on streaming multiprocessors (SMs). These threads are grouped into blocks and launched as kernels, utilizing a complex multi-level memory hierarchy including registers, shared memory/LDS (Local Data Share), L2 cache, and device memory. GPUs also have limited preemption capabilities compared to CPUs. This architectural complexity creates rich but challenging behavior patterns that are difficult to observe and customize, particularly when diagnosing performance bottlenecks, memory access patterns, warp divergence, or resource contention issues.
+
+### The Problem with Current GPU Observability Tools
+
+Today's GPU tracing and profiling landscape suffers from two major limitations:
+
+**1. CPU-Boundary Tools Lack Device-Side Visibility**
+Many tracing tools operate at the CPU boundary by placing probes on CUDA userspace libraries (like `libcuda.so`, `libcudart.so`) or kernel drivers. While these tools can capture host-side events such as kernel launches, memory copies, and API calls, they treat the GPU device as a black box. This approach provides:
+- No visibility into what happens *inside* a running kernel
+- Weak or no linkage to device-side events like warp stalls, bank conflicts, or memory traffic patterns
+- No ability to safely adapt or modify kernel behavior in-flight based on runtime conditions
+- Limited correlation between host actions and device-side performance issues
+
+**2. GPU-Specific Profilers Are Siloed**
+Device-side profilers like NVIDIA's CUPTI, Intel's GTPin, Nvbit, and Neutrino do provide detailed device-side visibility including instruction-level profiling, memory traces, and warp execution analysis. However, they suffer from:
+- **Vendor lock-in**: Each tool is typically tied to a specific GPU vendor (NVIDIA, AMD, Intel)
+- **Isolation from eBPF ecosystems**: These tools don't integrate with Linux's eBPF infrastructure, making it difficult to correlate GPU events with system-wide observability data from kprobes, uprobes, tracepoints, or network events
+- **Limited programmability**: Most provide fixed metrics rather than user-programmable instrumentation
+- **High overhead**: Binary instrumentation tools can introduce significant performance overhead (e.g., NVBit can be 10-100x slower)
+
+### bpftime's Unified Approach
+
+**bpftime bridges this gap** by offloading eBPF programs directly into GPU device contexts, bringing the same programmability model that revolutionized kernel observability to GPUs. The implementation includes:
+
+**GPU-Side Attach Points:**
+- Device function entry/exit for profiling kernel execution
+- Block begin/end for tracking thread block lifecycle
+- Barrier/synchronization points for analyzing warp coordination
+- Memory operation hooks for capturing access patterns
+- Stream operation events for tracking asynchronous execution
+
+**eBPF-to-GPU Compilation Pipeline:**
+- Compiles standard eBPF bytecode into GPU-native instruction sets (PTX for NVIDIA, SPIR-V for AMD)
+- Includes full verifier support to ensure safety and prevent crashes
+- Provides GPU-optimized helper functions for timing, thread identification, and map operations
+- Supports standard eBPF maps (hash, array, ringbuf) with GPU-resident variants for zero-copy access
+
+This unified approach enables:
+
+- **3-10x faster performance** than tools like NVBit for instrumentation
+- **Vendor-neutral design** that works across NVIDIA and AMD GPUs
+- **Unified observability** with Linux kernel eBPF programs (kprobes, uprobes)
+- **Fine-grained profiling** at the warp or instruction level
+- **Adaptive GPU kernel memory optimization** and programmable scheduling across SMs
+- **Accelerated eBPF applications** by leveraging GPU compute power
+
 ## Architecture
 
 ### CUDA Attachment Pipeline