Skip to content

Bump hyperlight-host to new snapshot api #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

ludfjig
Copy link
Contributor

@ludfjig ludfjig commented Jul 24, 2025

Bumps hyperlight dependency. Reworks uses of MultiuseSandboxContext to use snapshots instead.

Memory is no longer restored after a call to LoadedWasmSandbox::call_guest_function which revealed that memory is leaked during mashalling, and a missing post_call after calling a wasmtime func. In addition, a Store and wasm Instance are no longer created per guest function call. Rather, it's being created once and reused.

Copy link
Contributor

@dblnz dblnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work updating hyperlight-wasm to use the new snapshoting API.
I left a couple of comments/questions. Let me know what you think.

marosset and others added 2 commits July 28, 2025 10:51
…ead of mshv2)

Signed-off-by: Mark Rossett <marosset@microsoft.com>
Update all hyperlight dependencies from various revisions to a unified
172fcfa69b0f9064c7a0e48e512f8a86ae1fdbe1 snapshot that includes
snapshot/restore functionality.

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
Replace transition-based sandbox evolution with direct snapshot/restore API:

- Remove EvolvableSandbox/DevolvableSandbox trait usage
- Replace transition callbacks with direct method calls
- Add snapshot/restore methods to LoadedWasmSandbox
- Store initial snapshots in WasmSandbox for efficient unloading
- Export Snapshot type in public API

This provides more direct control over sandbox state management
and enables features like checkpoint/restore.

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
metrics::gauge!(METRIC_ACTIVE_WASM_SANDBOXES).increment(1);
metrics::counter!(METRIC_TOTAL_WASM_SANDBOXES).increment(1);
WasmSandbox { inner: Some(inner) }
Ok(WasmSandbox {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like a slightly weird separation of concerns that we do the restore in LoadedWasmSandbox and then actually create this here. Does it make sense to either move this code into LoadedWasmSandbox or move the restore into here? In a sense, for modularity, it feels to me like if WasmSandbox is going to bother to expose a constructor that tries to enforce its internal invariants, one of which is that inner is in a state that corresponds to snapshot, we should probably make the constructor enforce that if it can, rather than relying on the code calling it to do so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restore is made in here. Does that alleviate your concerns?

}
None => Err(new_error!("WasmSandbox is None, cannot load module")),
/// Helper function to finalize module loading and create LoadedWasmSandbox
fn finalize_module_load(mut self) -> Result<LoadedWasmSandbox> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taking self by value, so why do we need to move out of these fields? Is it just about the fact that we have drop glue on this type?

One option to get rid of the dynamic check here would be ManuallyDrop, but I don't know if we like the tradeoff of (closer to what we actually want, no unnecessary dynamic checks) to increased unsafe code surface, and I suppose we would still need to have a flag that let the Drop impl know whether or not the contents had been moved (or use ManuallyDrop<Self> to write an into_inner() function that removes the contents of the sandbox, manually updates the counters, and never runs the drop glue at all).

Relatedly, I think that scattered throughout our errors in general, we have a mix of "this means the user did something wrong, and they should be better" with "this means this is a fundamental invariant that we believed would always be true but was somehow violated". I would like us to separate these in an easily visible way (akin to rustc errors vs ICEs), since they have quite different consequences. I wonder if it is reasonable to panic in the latter case, since if we were wrong about one invariant, who knows what else we were wrong about...

Sorry, this comment has become a bit long and large in scope, so feel free to not consider it a blocker for this PR, but perhaps think a little bit about whether the Option semantics are indeed what we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with everything but will leave for another time. #136

}

/// Restore the state of the sandbox to the state captured in the given snapshot.
pub fn restore(&mut self, snapshot: &Snapshot) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a pub fn new_from_dirty_sandbox_and_snapshot(sandbox_to_be_erased: MultiUseSandbox, loaded_snapshot: Snapshot) (probably with less verbose names) or something like that?

I think that the common use case for e.g. function services is going to be "have a pool of sandboxes in various states, and whenever a request comes in, grab one, restore a sanpshot for the correct customer into it, and continue", which means that you would indeed want a function here which can just take any hyperlight sandbox in any state and restore the snapshot of runtime + image into it.

Copy link
Contributor Author

@ludfjig ludfjig Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would work quite yet since restoring only works for snapshots from the same sandbox right now. But that does sound good in general

)?;

// Track any allocations for later cleanup
if !allocated_addrs.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way that we are deferring the cleanup here makes me feel like we are not being precise about the lifetime and ownership of parameter/return data (in the informal sense, not the Rust sense). I guess that we are saying here that whenever you call into a sandbox, the parameters and the return values from any host call, both live until you return from that call. That seems like a bit of a strange and arbitrary invariant to me, and if we really want to go down that path we should definitely document it.

Why doesn't the wasm inside the component free these values itself, whenever it is done with them? It seems like their lifetimes only affect wasm memory (since there aren't any e.g. resource handles which refer out of the sandbox in the module variant), and so it seems like it ought to be a guest concern to free them? I think that would make all of this more similar to the component model design as well.

Instead of enforcing this lifetime this way, we could perhaps think about finding a way to add something to audit the guest heap usage, or something like that? I'm not super sure what would make sense here, but I don't really like this approach to deciding how long the memory in wasm lives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed, please have another look at marshal.rs top of file comment.

ludfjig added 2 commits August 1, 2025 12:39
- Reuse wasmtime Store and Instance across guest function calls instead
  of creating new one per call.
- Establish memory contract between host and guest.
 - Guest functions takes ownership of input parameters
 - Guest transfer ownership of return values
 - Host functions parameters are borrowed from guest
 - Host function return values are owned by guest and guest must free them.
- Component: Add post_return calls for proper WASM function cleanup
- Fix ABI mismatch in parameter of guest_dispatch_function

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants