-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Implement FFI task context and task context provider #18918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement FFI task context and task context provider #18918
Conversation
|
@ntjohnson1 a good starter FFI issue for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces FFI support for TaskContext and a new TaskContextProvider trait to enable cross-boundary function encoding/decoding. The changes address the limitation where registered UDFs couldn't be properly encoded/decoded across FFI boundaries due to the use of default session contexts.
Key changes:
- Added
TaskContextProvidertrait with implementations forSessionStateandSessionContext - Created FFI wrapper structures (
FFI_TaskContextandFFI_TaskContextProvider) for safe cross-boundary passing - Refactored session config handling to use direct
SessionConfiginstead of the wrapper type
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| datafusion/execution/src/task.rs | Defines the new TaskContextProvider trait |
| datafusion/execution/src/lib.rs | Exports the new TaskContextProvider trait |
| datafusion/core/src/execution/session_state.rs | Implements TaskContextProvider for SessionState |
| datafusion/core/src/execution/context/mod.rs | Implements TaskContextProvider for SessionContext |
| datafusion/ffi/src/session_config.rs | Refactors to use SessionConfig directly and adds library marker support |
| datafusion/ffi/src/table_provider.rs | Updates to use SessionConfig directly instead of wrapper |
| datafusion/ffi/src/execution/task_ctx.rs | Implements FFI wrapper for TaskContext |
| datafusion/ffi/src/execution/task_ctx_provider.rs | Implements FFI wrapper for TaskContextProvider with weak references |
| datafusion/ffi/src/execution/mod.rs | Module definition for execution-related FFI types |
| datafusion/ffi/src/lib.rs | Adds execution module to public API |
| datafusion/ffi/Cargo.toml | Adds required dependencies for execution support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ntjohnson1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM a few small clarifying questions/clarifications
| use datafusion_execution::config::SessionConfig; | ||
|
|
||
| /// A stable struct for sharing [`SessionConfig`] across FFI boundaries. | ||
| /// Instead of attempting to expose the entire SessionConfig interface, we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is no longer storing ConfigOptions as private data. So this doc string doesn't seem to be totally correct. It still uses config options to go from FFI_ to SessionConfig but doesn't convert to config options to get into FFI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further review I think the documentation is still correct. While we are no longer storing it as ConfigOptions in the private data, when we do pass it across the boundary we convert the session config into config options and pass those as described in the docstring.
| /// interface.s | ||
| pub struct ForeignSessionConfig(pub SessionConfig); | ||
|
|
||
| impl TryFrom<&FFI_SessionConfig> for ForeignSessionConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like maybe a note that the from_string approach allows directly converting rather than extracting with Foreign. This breaks a little bit from the pattern but definitely seems nicer here.
| #[repr(C)] | ||
| #[derive(Debug, StableAbi)] | ||
| #[allow(non_camel_case_types)] | ||
| pub struct FFI_TaskContextProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned this is used primarily for data coming across protobuf so should this be set more internally? Ex. pub(crate) or something. You mentioned usage in follow up so can review this point in context there if that makes more sense.
| unsafe impl Sync for FFI_TaskContextProvider {} | ||
|
|
||
| struct TaskContextProviderPrivateData { | ||
| ctx: Weak<dyn TaskContextProvider>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly ok if it makes more sense to look at in follow up PR note here. Do you have a concrete code snippet example for the circular reference this breaks? I feel like a small comment with that might make this clearer for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more text to the README and the follow on PRs should make it more abundantly clear.
| } | ||
|
|
||
| unsafe extern "C" fn release_fn_wrapper(ctx: &mut FFI_TaskContext) { | ||
| let private_data = Box::from_raw(ctx.private_data as *mut TaskContextPrivateData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like in other PR, any chances of double free here for concurrent execution?
I can see
unsafe impl Send/Sync
but it is not guaranteed afaik?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FFI_TaskContext is not cloneable, so I don't know how that would happen. Maybe I'm missing something?
More generally, those other structs that are Clone I don't think could possibly double free because each one owns its own private data.
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @timsaucer I think it is good and in the same consistent way like other PRs, my more curious questions on double free and if you planning to comment all pubs, it might be beneficial for auto generated docs
|
|
||
| /// Internal data. This is only to be accessed by the provider of the plan. | ||
| /// The foreign library should never attempt to access this data. | ||
| pub private_data: *mut c_void, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but could this lead to a double free? for example, if FFI users clone the FFI_TaskContext, we might end up holding multiple references to private_data .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FFI_TaskContext does not implement Clone. Do you see somewhere else this could happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @timsaucer , I was thinking on explicitly forbidding cloning / copying, somthing like
impl !Clone for FFI_TaskContext {}
impl !Copy for FFI_TaskContext {}
I was thinking that worst case scenario , FFI users could do a memcopy of the struct , so that it would be somehow "safer" to have something like
pub struct FFI_TaskContext {
// An opaque pointer to an Arc<TaskContextPrivateData>
arc_ptr: *const Arc<TaskContextPrivateData>,
}
and then returning a raw pointer to inner data , something along these lines
pub fn as_opaque(&self) -> *const c_void {
self.arc_ptr as *const c_void
}
| let udf = <Arc<dyn ScalarUDFImpl>>::try_from(&kv_pair.1); | ||
|
|
||
| if let Err(err) = &udf { | ||
| log::error!("Unable to create WindowUDF in FFI: {err}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question on the expected behavior here, let's say that we fail while attempting try_From , could not failing loud cause confusion/bugs of functions not found/missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent question! In reviewing I discovered that I had some intermediate work where my conversion of FFI UDFs was fallible but the final versions were infallible! I updated the traits to use From<> instead of TryFrom<> so this is no longer an issue in this code snippet. Thank you for bringing it up!
867398f to
053ef21
Compare
|
Thank you for the reviews @ntjohnson1 @renato2099 @comphead ! I've rebased now that the other PR went in and made some adjustments based on the feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @timsaucer for driving this and appreciate @renato2099 @ntjohnson1 for the review
Which issue does this PR close?
Addresses part of #18671 but does not close it.
Rationale for this change
We have an issue in the current FFI code in that we will need access to the
TaskContextto do function encoding and decoding to pass logical expressions across the boundary. We use the proto crate for this operations.We cannot encode and decode functions that have been registered since our current code creates a default session context. This causes problems in crates where we are using UDFs as inputs to either aggregations, window functions, or table filters.
With this change we keep a weak reference to a new trait, a
TaskContextProvider. By keeping a weak trait we make sure we do not create a circular dependency between function that is internally holding on to the provider, which holds the task context, which holds the function.What changes are included in this PR?
TaskContextProvidertraitTaskContextandTaskContextProvider.This PR does not use these structures in the current code. That is coming as part of a later PR in an effort to keep the size of the PRs small for effective code review.
Are these changes tested?
Unit tests are added. Coverage report:
Are there any user-facing changes?
No.