Skip to content

Commit bcf1d4c

Browse files
feat: add GPU lowering to selene-hugr-qis-compiler (#1169)
This PR provides lowering for qsystem's GPU extension for selene-specific workflows. This is my first compilation PR, so if it's janky I apologise in advance. There were some battles with getting to grips with inkwell, but I'm very happy with the result. # Background The qsystem GPU extension provides a general mechanism for declaring and using the API of a gpu library. For generality, this mechanism involves: - the user defining functions with: - any (valid) function name, - any arrangement of integer, floating point, or boolean arguments, - a return type of void, float, or integer. - function names having some mapping to an integer tag - the ability to acquire and discard a GPU context - the ability to invoke the aforementioned functions given a gpu context, an integer tag, and the appropriate parameters - the ability to retrieve results after invoking said functions The intent is that the library appropriately fulfilling the user's specifications is linked with the user program in order to provide the required functionality. In Selene, this would look like adding the library as a `utility` at the build stage. As this is the first such utility that represents a generic usecase, breakages could get ugly. As such, I took some care to handle future breaking changes. # API design This lowering assumes the following function signatures in the library to be linked with the user program: ```c char const* gpu_get_error(); ``` This provides a string representing an error that should be made available upon any of the following functions failing. It should return nullptr if there is no error to fetch. Each of the following functions provide bool return values which represent success (true) or failure (false). Each result is validated and, upon false, provides an extended panic message to the user containing the result of `gpu_get_error()` (or "No error message available" upon a nullptr being returned). I utilise Selene's `panic_str` QIS extension to emit panics that are not constrained to 255 bytes, as a custom library may wish to provide far more detail than would fit in 255 bytes. Given this lowering is specific to selene, I don't believe that deviating from standard QIS will be a controversial choice. ```c bool gpu_validate_api(uint64_t major, uint64_t minor, uint64_t patch); ``` We do not currently know how the remainder of the API will change over time: - It's feasible that array arguments, strings etc could be added at a later time. This will particularly impact the signature API (see the `gpu_call` description). - Returning different kinds of data may be useful in future (and so fetching results may require additional approaches) As such, this should be the first call to the GPU library, passing an API version (currently 0.1.0, the version I'm assigning to the API I am currently describing). It is invoked before any function that depends on breakable aspects of the API, with caching to avoid multiple calls. It should return true on success, false otherwise. Upon failure, gpu_get_error() is called, and therefore **these two functions must be kept compatible in future editions** if we opt to make breaking changes in future. The remainder of functions may be broken in future editions, and as long as gpu_validate_api is managed appropriately on the library side, we should at least be able to fail early (at linking or early at runtime) to prevent undefined behaviour (e.g. by invoking a function with a modified signature). ```c bool gpu_init(uint64_t _reserved, uint64_t* gpu_ref_out); ``` The `reserved` identifier may be used in future if we wish to use multiple instances of gpu libraries, e.g. for maintaining distinct state between them. ```c bool gpu_discard(uint64_t gpu_ref); ``` After which, the library should free resources associated with the `gpu_ref` handle, and further invocations using this handle should error. ```c bool gpu_get_function_id(char const* name, uint64_t* id_out) ``` Some implementations may wish to map functions to indices in an array, or use a hash on the incoming names, or something else. This is an opportune moment to return false and thus fail early if the requested function name is not supported by the linked library. ```c bool gpu_call( uint64_t handle, uint64_t function_id, uint64_t blob_size, char const* blob, char const* signature ) ``` The handle and function_id parameters should be apparent at this point. When a function is invoked, it requires parameters. This is where the blob and signature come in: - parameters are packed (in bytes) into an array, which is sent to the gpu_call. - I chose this over varargs because varargs aren't particularly friendly to work with in a diagnostic setting. They can't be passed to other functions, for example. - the size is also passed to the library, which allows for quick validation and safer parsing - for completeness, a signature string (generated at compile-time) is also passed as an argument. - Types are encoded as `i64 => i, u64 => i, f64 => f, bool => b`, and the inputs and return type are separated by a colon. - The resulting signature string is in the form e.g. - `iifb:v` for `(int64, uint64, float, bool) -> void` - `b:f` for `(bool)->float` - This is also an opportunity for validation. _Perhaps it should be moved to `get_function_id` to for validation to take place earlier_. ```c bool gpu_get_result(uint64_t gpu_ref, uint64_t out_len, char* out_result) ``` This extracts a result from a previous call, in a FIFO manner. Currently the only `out_len` supported by the underlying hugr is 64 bits, as functions are assumed to return double or uint64, but this can be changed at a later date without breaking the API. The lowering provided in this PR handles the casting and alignment. # Controversial choices - It might be more appropriate to return an integer as an error code, rather than a boolean flag representing success or failure. However, error codes are primarily useful in areas where we have a defined system for handling different forms of error. We don't really have a way of catching those errors in the user code, so we always end up either continuing or terminating. Bool felt satisfactory here. In future this could be broken - all we need to do is keep the bool return for `gpu_validate_api`, which is a clear success or fail case anyway. - The use of `panic_str` may seem odd, but I found it very useful during the implementation of this. A library I am testing out provides stack traces upon failure, and this helped identify the issue with my calls - directly in panic messages on the other side, where a normal `panic()` would have otherwise truncated it. If it helped me, it will help users. - I chose to force the validation of returned function statuses to _not_ be inline. It's a two-line removal if we want it inline. The primary reason I chose to do this is that the error handling needlessly clutters up otherwise-clean LLVM IR. If that isn't a good enough reason I can remove it.
1 parent b82c8ad commit bcf1d4c

File tree

32 files changed

+3192
-0
lines changed

32 files changed

+3192
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,6 @@ jupyter_execute/
3838

3939
# binaries
4040
*.so
41+
42+
# a reserved working dir for local experimentation
43+
/scratch

Cargo.lock

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

qis-compiler/Cargo.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ pyo3 = { workspace = true, features = ["abi3-py310", "anyhow"] }
1919
serde_json.workspace = true
2020
tracing.workspace = true
2121
itertools.workspace = true
22+
strum.workspace = true
2223
tket = { path = "../tket", version = "0.16.0" }
2324
tket-qsystem = { path = "../tket-qsystem", version = "0.22.0", features = [
2425
"llvm",
@@ -36,6 +37,7 @@ pyo3-build-config.workspace = true
3637
rstest.workspace = true
3738
serde.workspace = true
3839
typetag.workspace = true
40+
insta = "1.43"
3941

4042
[package.metadata.cargo-machete]
4143
ignored = ["cbindgen", "pyo3-build-config"]
6.64 KB
Binary file not shown.
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
; ModuleID = 'hugr'
2+
source_filename = "hugr"
3+
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
4+
target triple = "aarch64-apple-darwin"
5+
6+
@gpu_cache_is_set_function_id_fn_returning_float = thread_local local_unnamed_addr global i8 0
7+
@gpu_cache_function_id_fn_returning_float = thread_local local_unnamed_addr global i64 0
8+
@gpu_validated = thread_local local_unnamed_addr global i8 0
9+
@no_gpu_error = private unnamed_addr constant [27 x i8] c"No error message available\00", align 1
10+
@function_name = private unnamed_addr constant [19 x i8] c"fn_returning_float\00", align 1
11+
@gpu_cache_is_set_function_id_fn_returning_int = thread_local local_unnamed_addr global i8 0
12+
@gpu_cache_function_id_fn_returning_int = thread_local local_unnamed_addr global i64 0
13+
@function_name.1 = private unnamed_addr constant [17 x i8] c"fn_returning_int\00", align 1
14+
@arg_types = private unnamed_addr constant [5 x i8] c"if:i\00", align 1
15+
@res_a.19FB4E83.0 = private constant [11 x i8] c"\0AUSER:INT:a"
16+
@arg_types.2 = private unnamed_addr constant [4 x i8] c"i:f\00", align 1
17+
@res_b.0E048F9C.0 = private constant [13 x i8] c"\0CUSER:FLOAT:b"
18+
19+
; Function Attrs: noinline
20+
define i64 @gpu_function_id_fn_returning_float() local_unnamed_addr #0 {
21+
entry:
22+
%function_id = load i8, i8* @gpu_cache_is_set_function_id_fn_returning_float, align 1
23+
%needs_lookup = icmp eq i8 %function_id, 0
24+
br i1 %needs_lookup, label %lookup, label %read_cache
25+
26+
common.ret: ; preds = %read_cache, %lookup
27+
%common.ret.op = phi i64 [ %function_id2, %lookup ], [ %function_id1, %read_cache ]
28+
ret i64 %common.ret.op
29+
30+
lookup: ; preds = %entry
31+
tail call void @run_gpu_validation()
32+
%function_id_ptr = alloca i64, align 8
33+
%function_id_call = call i8 @gpu_get_function_id(i8* getelementptr inbounds ([19 x i8], [19 x i8]* @function_name, i64 0, i64 0), i64* nonnull %function_id_ptr)
34+
call void @validate_gpu_response(i8 %function_id_call)
35+
%function_id2 = load i64, i64* %function_id_ptr, align 8
36+
store i64 %function_id2, i64* @gpu_cache_function_id_fn_returning_float, align 8
37+
store i8 1, i8* @gpu_cache_is_set_function_id_fn_returning_float, align 1
38+
br label %common.ret
39+
40+
read_cache: ; preds = %entry
41+
%function_id1 = load i64, i64* @gpu_cache_function_id_fn_returning_float, align 8
42+
br label %common.ret
43+
}
44+
45+
; Function Attrs: noinline
46+
define void @run_gpu_validation() local_unnamed_addr #0 {
47+
entry:
48+
%validated = load i8, i8* @gpu_validated, align 1
49+
%already_validated.not = icmp eq i8 %validated, 0
50+
br i1 %already_validated.not, label %validate, label %common.ret
51+
52+
common.ret: ; preds = %entry, %validate
53+
ret void
54+
55+
validate: ; preds = %entry
56+
%validate_call = tail call i8 @gpu_validate_api(i64 0, i64 1, i64 0)
57+
tail call void @validate_gpu_response(i8 %validate_call)
58+
store i8 1, i8* @gpu_validated, align 1
59+
br label %common.ret
60+
}
61+
62+
declare i8 @gpu_validate_api(i64, i64, i64) local_unnamed_addr
63+
64+
; Function Attrs: noinline
65+
define void @validate_gpu_response(i8 %0) local_unnamed_addr #0 {
66+
entry:
67+
%success.not = icmp eq i8 %0, 0
68+
br i1 %success.not, label %err, label %ok
69+
70+
ok: ; preds = %entry
71+
ret void
72+
73+
err: ; preds = %entry
74+
tail call void @gpu_error_handler()
75+
unreachable
76+
}
77+
78+
; Function Attrs: noinline noreturn
79+
define void @gpu_error_handler() local_unnamed_addr #1 {
80+
entry:
81+
%error_message = tail call i8* @gpu_get_error()
82+
%is_null = icmp eq i8* %error_message, null
83+
%error_message_nonnull = select i1 %is_null, i8* getelementptr inbounds ([27 x i8], [27 x i8]* @no_gpu_error, i64 0, i64 0), i8* %error_message
84+
tail call void @panic_str(i32 70002, i8* %error_message_nonnull)
85+
unreachable
86+
}
87+
88+
declare i8* @gpu_get_error() local_unnamed_addr
89+
90+
; Function Attrs: noreturn
91+
declare void @panic_str(i32, i8*) local_unnamed_addr #2
92+
93+
declare i8 @gpu_get_function_id(i8*, i64*) local_unnamed_addr
94+
95+
; Function Attrs: noinline
96+
define i64 @gpu_function_id_fn_returning_int() local_unnamed_addr #0 {
97+
entry:
98+
%function_id = load i8, i8* @gpu_cache_is_set_function_id_fn_returning_int, align 1
99+
%needs_lookup = icmp eq i8 %function_id, 0
100+
br i1 %needs_lookup, label %lookup, label %read_cache
101+
102+
common.ret: ; preds = %read_cache, %lookup
103+
%common.ret.op = phi i64 [ %function_id2, %lookup ], [ %function_id1, %read_cache ]
104+
ret i64 %common.ret.op
105+
106+
lookup: ; preds = %entry
107+
tail call void @run_gpu_validation()
108+
%function_id_ptr = alloca i64, align 8
109+
%function_id_call = call i8 @gpu_get_function_id(i8* getelementptr inbounds ([17 x i8], [17 x i8]* @function_name.1, i64 0, i64 0), i64* nonnull %function_id_ptr)
110+
call void @validate_gpu_response(i8 %function_id_call)
111+
%function_id2 = load i64, i64* %function_id_ptr, align 8
112+
store i64 %function_id2, i64* @gpu_cache_function_id_fn_returning_int, align 8
113+
store i8 1, i8* @gpu_cache_is_set_function_id_fn_returning_int, align 1
114+
br label %common.ret
115+
116+
read_cache: ; preds = %entry
117+
%function_id1 = load i64, i64* @gpu_cache_function_id_fn_returning_int, align 8
118+
br label %common.ret
119+
}
120+
121+
declare i8 @gpu_init(i64, i64*) local_unnamed_addr
122+
123+
declare i8 @gpu_call(i64, i64, i64, i8*, i8*) local_unnamed_addr
124+
125+
declare i8 @gpu_get_result(i64, i64, i8*) local_unnamed_addr
126+
127+
declare void @print_int(i8*, i64, i64) local_unnamed_addr
128+
129+
declare i8 @gpu_discard(i64) local_unnamed_addr
130+
131+
declare void @print_float(i8*, i64, double) local_unnamed_addr
132+
133+
define i64 @qmain(i64 %0) local_unnamed_addr {
134+
entry:
135+
%gpu_ref_ptr.i = alloca i64, align 8
136+
%gpu_input_blob.i = alloca [16 x i8], align 8
137+
%int_result.i = alloca i64, align 8
138+
%gpu_input_blob26.i = alloca i64, align 8
139+
%int_result32.i = alloca i64, align 8
140+
tail call void @setup(i64 %0)
141+
%1 = bitcast i64* %gpu_ref_ptr.i to i8*
142+
call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %1)
143+
%2 = getelementptr inbounds [16 x i8], [16 x i8]* %gpu_input_blob.i, i64 0, i64 0
144+
call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %2)
145+
%3 = bitcast i64* %int_result.i to i8*
146+
call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %3)
147+
%4 = bitcast i64* %gpu_input_blob26.i to i8*
148+
call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %4)
149+
%5 = bitcast i64* %int_result32.i to i8*
150+
call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %5)
151+
%function_id_call.i = tail call i64 @gpu_function_id_fn_returning_float()
152+
%function_id_call3.i = tail call i64 @gpu_function_id_fn_returning_int()
153+
tail call void @run_gpu_validation()
154+
%gpu_ref_call.i = call i8 @gpu_init(i64 0, i64* nonnull %gpu_ref_ptr.i)
155+
call void @validate_gpu_response(i8 %gpu_ref_call.i)
156+
%gpu_ref.i = load i64, i64* %gpu_ref_ptr.i, align 8
157+
%6 = bitcast [16 x i8]* %gpu_input_blob.i to i64*
158+
store i64 42, i64* %6, align 8
159+
%dest_ptr17.i = getelementptr inbounds [16 x i8], [16 x i8]* %gpu_input_blob.i, i64 0, i64 8
160+
%7 = bitcast i8* %dest_ptr17.i to i64*
161+
store i64 4613303441197561744, i64* %7, align 8
162+
%8 = call i8 @gpu_call(i64 %gpu_ref.i, i64 %function_id_call3.i, i64 16, i8* nonnull %2, i8* getelementptr inbounds ([5 x i8], [5 x i8]* @arg_types, i64 0, i64 0))
163+
call void @validate_gpu_response(i8 %8)
164+
%read_status.i = call i8 @gpu_get_result(i64 %gpu_ref.i, i64 8, i8* nonnull %3)
165+
call void @validate_gpu_response(i8 %read_status.i)
166+
%int_result20.i = load i64, i64* %int_result.i, align 8
167+
call void @print_int(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @res_a.19FB4E83.0, i64 0, i64 0), i64 10, i64 %int_result20.i)
168+
store i64 %int_result20.i, i64* %gpu_input_blob26.i, align 8
169+
%9 = call i8 @gpu_call(i64 %gpu_ref.i, i64 %function_id_call.i, i64 8, i8* nonnull %4, i8* getelementptr inbounds ([4 x i8], [4 x i8]* @arg_types.2, i64 0, i64 0))
170+
call void @validate_gpu_response(i8 %9)
171+
%read_status34.i = call i8 @gpu_get_result(i64 %gpu_ref.i, i64 8, i8* nonnull %5)
172+
call void @validate_gpu_response(i8 %read_status34.i)
173+
%float_result_ptr.i = bitcast i64* %int_result32.i to double*
174+
%float_result.i = load double, double* %float_result_ptr.i, align 8
175+
%10 = call i8 @gpu_discard(i64 %gpu_ref.i)
176+
call void @validate_gpu_response(i8 %10)
177+
call void @print_float(i8* getelementptr inbounds ([13 x i8], [13 x i8]* @res_b.0E048F9C.0, i64 0, i64 0), i64 12, double %float_result.i)
178+
call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %1)
179+
call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %2)
180+
call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %3)
181+
call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %4)
182+
call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %5)
183+
%11 = call i64 @teardown()
184+
ret i64 %11
185+
}
186+
187+
declare void @setup(i64) local_unnamed_addr
188+
189+
declare i64 @teardown() local_unnamed_addr
190+
191+
; Function Attrs: argmemonly nofree nosync nounwind willreturn
192+
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #3
193+
194+
; Function Attrs: argmemonly nofree nosync nounwind willreturn
195+
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #3
196+
197+
attributes #0 = { noinline }
198+
attributes #1 = { noinline noreturn }
199+
attributes #2 = { noreturn }
200+
attributes #3 = { argmemonly nofree nosync nounwind willreturn }
201+
202+
!name = !{!0}
203+
204+
!0 = !{!"mainlib"}

0 commit comments

Comments
 (0)