Skip to content

Commit 12a4f1e

Browse files
yunwei37Copilot
andauthored
enhances the CUDA benchmarking and add memtrace (#453)
* docs: Revise GPU README to enhance clarity on eBPF observability challenges and bpftime's unified approach * docs: Update installation instructions to include new CUDA/GPU attach support and additional build options * docs: Update NVBit installation instructions and streamline Makefile for auto-download * docs: Update Makefile and README examples for CUDA benchmarks and correct paths * Add CUDA vector addition benchmark with eBPF and NVBit instrumentation - Introduced a new benchmark for CUDA vector addition that utilizes eBPF and NVBit for performance monitoring. - Created Makefile for building the benchmark and its components. - Implemented eBPF program (cuda_probe.bpf.c) to trace CUDA kernel invocations. - Developed userspace loader (cuda_probe.c) to load and attach the eBPF program. - Added NVBit instrumentation (nvbit_vec_add.cu) to measure kernel execution time. - Included vector addition kernel (vec_add.cu) for benchmarking. - Updated README.md with instructions for building and running the benchmark. - Added mem_trace example to trace CUDA kernel invocations using eBPF. - Created Makefile and README.md for mem_trace example. - Implemented eBPF program (mem_trace.bpf.c) to monitor CUDA kernel invocations. - Developed userspace loader (mem_trace.c) for the mem_trace example. - Added vector addition kernel (vec_add.cu) for the mem_trace example. * feat: Add CUDA benchmark runner script for performance comparison * Update benchmark/gpu/run_cuda_bench.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent edf049a commit 12a4f1e

File tree

23 files changed

+948
-341
lines changed

23 files changed

+948
-341
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,3 +61,4 @@ example/attach_implementation/benchmark/wasm-micro-runtime/
6161

6262
build
6363
target
64+

README.md

Lines changed: 5 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ bpftime is not `userspace eBPF VM`, it's a userspace runtime framework includes
2727

2828
## Key Features
2929

30-
- **Dynamic Binary rewriting**: Run eBPF programs in userspace, attaching them to `Uprobes` and `Syscall tracepoints`: **No manual instrumentation or restart required!**. It can `trace` or `change` the execution of a function, `hook` or `filter` all syscalls of a process safely, and efficiently with an eBPF userspace runtime. Can inject eBPF runtime into any running process without the need for a restart or manual recompilation.
31-
- **Performance**: Experience up to a `10x` speedup in Uprobe overhead compared to kernel uprobe and uretprobe. Read/Write userspace memory is also faster than kernel eBPF.
30+
- **Dynamic Binary rewriting**: Run eBPF programs in userspace, attaching them to `Uprobes`, `Syscall tracepoints` and inside `GPU` kernel: **No manual instrumentation or restart required!**. It can `trace` or `change` the execution of a function, `hook` or `filter` all syscalls of a process safely, and efficiently with an eBPF userspace runtime. Can inject eBPF runtime into any running process without the need for a restart or manual recompilation.
31+
- **Performance**: Experience up to a `10x` speedup in Uprobe overhead compared to kernel uprobe and uretprobe, up to a 10x faster than `NVbit`. Read/Write userspace memory is also faster than kernel eBPF.
3232
- **Interprocess eBPF Maps**: Implement userspace `eBPF maps` in shared userspace memory for summary aggregation or control plane communication.
3333
- **Compatibility**: use `existing eBPF toolchains` like clang, libbpf and bpftrace to develop userspace eBPF application without any modifications. Supporting CO-RE via BTF, and offering userspace `ufunc` access.
3434
- **Multi JIT Support**: Support [llvmbpf](https://github.com/eunomia-bpf/llvmbpf), a high-speed `JIT/AOT` compiler powered by LLVM, or using `ubpf JIT` and INTERPRETER. The vm can be built as `a standalone library` like ubpf.
@@ -112,20 +112,7 @@ See [eunomia.dev/bpftime/documents/usage](https://eunomia.dev/bpftime/documents/
112112

113113
## Examples & Use Cases
114114

115-
For more examples and details, please refer to [eunomia.dev/bpftime/documents/examples/](https://eunomia.dev/bpftime/documents/examples/) webpage.
116-
117-
Examples including:
118-
119-
- [Minimal examples](https://github.com/eunomia-bpf/bpftime/tree/master/example/minimal) of eBPF programs.
120-
- eBPF `Uprobe/USDT` tracing and `syscall tracing`:
121-
- [sslsniff](https://github.com/eunomia-bpf/bpftime/tree/master/example/sslsniff) for trace SSL/TLS unencrypted data.
122-
- [opensnoop](https://github.com/eunomia-bpf/bpftime/tree/master/example/opensnoop) for trace file open syscalls.
123-
- More [bcc/libbpf-tools](https://github.com/eunomia-bpf/bpftime/tree/master/example/libbpf-tools).
124-
- Run with [bpftrace](https://github.com/eunomia-bpf/bpftime/tree/master/example/bpftrace) commands or scripts.
125-
- [error injection](https://github.com/eunomia-bpf/bpftime/tree/master/example/error-inject): change function behavior with `bpf_override_return`.
126-
- Use the eBPF LLVM JIT/AOT vm as [a standalone library](https://github.com/eunomia-bpf/llvmbpf/tree/main/example).
127-
- Userspace [XDP with DPDK and AF_XDP](https://github.com/userspace-xdp/userspace-xdp)
128-
- [CUDA eBPF Probe/Retprobe Example](https://github.com/eunomia-bpf/bpftime/tree/master/example/cuda-counter)
115+
For more examples and details, please refer to [eunomia.dev/bpftime/documents/examples/](https://eunomia.dev/bpftime/documents/examples/) webpage and [example](https://github.com/eunomia-bpf/bpftime/tree/master/example/) dir.
129116

130117
## In-Depth
131118

@@ -153,7 +140,7 @@ Current hook implementation is based on binary rewriting and the underly techniq
153140

154141
- Userspace function hook: [frida-gum](https://github.com/frida/frida-gum)
155142
- Syscall hooks: [zpoline](https://www.usenix.org/conference/atc23/presentation/yasukata) and [pmem/syscall_intercept](https://github.com/pmem/syscall_intercept).
156-
- GPU hooks: our new implement by convert eBPF into PTX and inject into GPU kernel. See [attach/nv_attach_impl](https://github.com/eunomia-bpf/bpftime/tree/master/attach/nv_attach_impl) for more details.
143+
- GPU hooks: our new implementation by converting eBPF into PTX and injecting into GPU kernels. See [attach/nv_attach_impl](https://github.com/eunomia-bpf/bpftime/tree/master/attach/nv_attach_impl) for more details.
157144
- XDP with DPDK. See the [uXDP paper](https://dl.acm.org/doi/10.1145/3748355.3748360) for more details.
158145

159146
The hook can be easily replaced with other DBI methods or frameworks, to make it a general extension framework. See our OSDI '25 paper [Extending Applications Safely and Efficiently](https://www.usenix.org/conference/osdi25/presentation/zheng-yusheng) for details.
@@ -180,7 +167,7 @@ This project is licensed under the MIT License.
180167

181168
## Contact and citations
182169

183-
Have any questions or suggestions on future development? Free free to open an issue or contact
170+
Have any questions or suggestions on future development? Feel free to open an issue or contact
184171
<yunwei356@gmail.com> !
185172

186173
Our OSDI '25 paper: <https://www.usenix.org/conference/osdi25/presentation/zheng-yusheng>

benchmark/cuda/README.md

Lines changed: 0 additions & 244 deletions
This file was deleted.
File renamed without changes.

benchmark/cuda/Makefile renamed to benchmark/gpu/Makefile

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -81,8 +81,10 @@ else
8181
NVCC_OPT=-O3
8282
endif
8383

84-
# Path to NVBit - adjust this to your system's NVBit location
85-
NVBIT_PATH=$(HOME)/yunwei37/nvbit_release_x86_64/core
84+
# Path to NVBit - NVBit is NOT part of CUDA Toolkit, will be auto-downloaded if needed
85+
# Adjust this to your system's NVBit installation location
86+
NVBIT_PATH?=$(HOME)/nvbit_release_x86_64/core
87+
NVBIT_RELEASE_DIR?=$(HOME)/nvbit_release_x86_64
8688
NVBIT_INCLUDES=-I$(NVBIT_PATH)
8789

8890
NVBIT_LIBS=-L$(NVBIT_PATH) -lnvbit
@@ -92,11 +94,23 @@ NVBIT_SOURCES=nvbit_vec_add.cu nvbit_timing_funcs.cu
9294
NVBIT_OBJECTS=$(NVBIT_SOURCES:.cu=.o)
9395
CUDA_ARCH?=all
9496

95-
nvbit_vec_add.so: $(NVBIT_OBJECTS) $(NVBIT_PATH)/libnvbit.a
96-
$(NVCC) -arch=$(CUDA_ARCH) $(NVCC_OPT) $(NVBIT_OBJECTS) $(NVBIT_LIBS) $(NVCC_PATH) -lcuda -lcudart_static -shared -o $@
97+
# Auto-download and install NVBit if not found
98+
$(NVBIT_PATH)/libnvbit.a:
99+
@echo "NVBit not found, downloading version 1.7.6 with SM_120 support..."
100+
@mkdir -p $(HOME)
101+
@cd $(HOME) && \
102+
rm -rf nvbit_release_x86_64 && \
103+
wget -q https://github.com/NVlabs/NVBit/releases/download/v1.7.6/nvbit-Linux-x86_64-1.7.6.tar.bz2 && \
104+
tar xjf nvbit-Linux-x86_64-1.7.6.tar.bz2 && \
105+
rm nvbit-Linux-x86_64-1.7.6.tar.bz2
106+
@echo "NVBit 1.7.6 installed to $(NVBIT_RELEASE_DIR)"
97107

98-
%.o: %.cu
99-
$(NVCC) -dc -c -std=c++11 $(NVBIT_INCLUDES) -Xptxas -cloning=no -Xcompiler -Wall -arch=$(CUDA_ARCH) $(NVCC_OPT) -Xcompiler -fPIC $< -o $@
108+
nvbit_vec_add.so: vec_add nvbit_vec_add.o nvbit_timing_funcs.o $(NVBIT_PATH)/libnvbit.a
109+
g++ -shared -fPIC nvbit_vec_add.o nvbit_timing_funcs.o \
110+
$(NVBIT_LIBS) $(NVCC_PATH) -lcuda -lcudart_static -lpthread -ldl -o $@
111+
112+
nvbit_vec_add.o: nvbit_vec_add.cu
113+
$(NVCC) -c -std=c++11 $(NVBIT_INCLUDES) -Xcompiler -Wall $(NVCC_OPT) -Xcompiler -fPIC $< -o $@
100114

101115
nvbit_timing_funcs.o: nvbit_timing_funcs.cu
102116
$(NVCC) $(NVBIT_INCLUDES) -Xptxas -astoolspatch --keep-device-functions -arch=$(CUDA_ARCH) -Xcompiler -Wall -Xcompiler -fPIC -c $< -o $@
@@ -166,11 +180,13 @@ $(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) | $(OUTPUT)
166180

167181
# Run the CUDA vector addition benchmark with NVBit instrumentation
168182
run_nvbit: nvbit_vec_add.so
169-
CUDA_VISIBLE_DEVICES=0 LD_PRELOAD=./nvbit_vec_add.so ./vec_add 100
183+
CUDA_VISIBLE_DEVICES=0 LD_PRELOAD=./nvbit_vec_add.so ./vec_add
170184

171185
# Run with verbose output
172186
run_nvbit_verbose: nvbit_vec_add.so
173-
CUDA_VISIBLE_DEVICES=0 LD_PRELOAD=./nvbit_vec_add.so TOOL_VERBOSE=1 ./vec_add 100
187+
CUDA_VISIBLE_DEVICES=0 LD_PRELOAD=./nvbit_vec_add.so TOOL_VERBOSE=1 ./vec_add
188+
189+
nvbit: nvbit_vec_add.so
174190

175191
# delete failed targets
176192
.DELETE_ON_ERROR:

0 commit comments

Comments
 (0)