Skip to content

Conversation

BlobCodes
Copy link
Contributor

Related to #15088

This MR adds a Fiber-Local Storage functionality, based on #15616 (comment).


I'm sure this MR still needs some work, especially regarding non-TLS platforms like OpenBSD and Android.

Also, there's probably some discussion needed about the public API.


Fibers created in crystal will use a stack-allocated FLS, but external threads use a fallback which uses the GC.


To improve the FLS performance, I added an unsafe option to the ThreadLocal annotation.
The reason thread-local variables are normally slow is that fibers can spontaneously switch the thread they are on in lots of situations, so LLVM can never be allowed to make any assumptions about any TLS variables.

Since the FLS-section is restored when any thread resumes a fiber, this should not be a problem here.

The unsafe option removes this safety while retaining backwards compatibility.


Example Code
class A
  fiber_local my_fls_var : Int32 = 1

  def self.incr
    puts self.my_fls_var
    self.my_fls_var += 1
  end
end

spawn do
  sleep 3
  A.incr #=> 1
  A.incr #=> 2
  A.incr #=> 3
end

spawn do
  sleep 2
  A.incr #=> 1
  A.incr #=> 2
end

A.incr #=> 1
A.incr #=> 2
sleep 5

Copy link

@Qard Qard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting!

Hi there, I'm the maintainer of AsyncLocalStorage, the Node.js equivalent of this. 👋🏻

I've put a bunch of thought into context flow in the semantics of JS, but things are a bit different with Fibers. In JS we have a lot of branching execution via promises and callbacks, with merging execution via Promise.all(...). Crystal similarly has clear branching points with spawn calls, but merges are not so much a semantically relevant thing other than maybe fiber.resume which could technically be used in a user-facing way to force a specific fiber context switch. Other than that, there's not so much in the way of clear merge points, though that could change with the possibility of TaskGroup semantics.

I see two additions which could be useful to consider.

The first is a way for the initializer to be able specify to inherit what the context was where the initial spawn call for that Fiber occurred. This would be very useful for passing spans for application tracing through the system. Observability vendors use this capability extensively with AsyncLocalStorage in Node.js to get a parent span to link to when constructing a new span.

The other interesting flow I see, though more nebulous presently, is a way to capture the merging flow. If we get something like TaskGroup/WaitGroup semantics where a group of fibers run to completion before continuing then it could be valuable to have a way to join the related contexts and communicate them forward in some way whenever one of these merge points occurs.

Technically ExecutionContext already basically provides TaskGroup semantics, though it's intended more as somewhat of a singleton, living until the end of the process. It might not be super valuable there to perform some sort of context join, but if more discrete TaskGroup constructs were available it may be useful to be able to model such joining as the semantics of that join would extend out to user-visible behaviour, which is essentially what context flow is intended to model.

In any case, this is very cool to see! I'm excited to see this capability land. 🙂

@straight-shoota
Copy link
Member

Thanks for pushing this on.

I'm not sure if this solution is the best path forward, due to my concerns in #15616 (comment). As far as I understand, this mechanism only works for class variables.
Perhaps that's fine if there are no good cases for using fiber local storage on other kinds of variables, though.

The unsafe TLS implementation looks interesting and could be worth a consideration on its own, independently of the FLS.
Do you think we could be able to use unsafe mode for Thread.@@current_thread as well?

@straight-shoota
Copy link
Member

@Qard Your insights are very welcome. Especially the practical experiences.

WaitGroup is already a thing. More enhanced methods of structuring concurrency are discussed in #6468.

I'm wondering what would happen at a merge point. Does that mean he values set in different strings of execution would be merged into the parent context? How to handle collisions, though?

@Qard
Copy link

Qard commented Jun 10, 2025

Oh, I missed that. I just recalled discussions of it but never saw it had landed. 🙂

As for merge points, I don't think it should replace the context value where a merge point is reached, but having some separate way to acquire a list of context values at the point where a merge occurs would be enough.

Merges are less important in my opinion though. For the purpose of enabling Observability products, I should say that enabling the context flow through branching points is a feature being used now for supporting application tracing. Merge points however are not presently handled as no context manager yet has a solution for that flow.

Presently tracing tools don't really visualize merge points correctly. They generally pick a winner from the list of contexts or don't link back to the merge points at all but rather flatten everything up to the root span of the trace, which loses a lot of causality data. This is a problem which would be good to solve, but as yet no language has solved.

@Qard
Copy link

Qard commented Jun 11, 2025

I'm not sure if this solution is the best path forward, due to my concerns in #15616 (comment). As far as I understand, this mechanism only works for class variables.

Perhaps that's fine if there are no good cases for using fiber local storage on other kinds of variables, though.

I think that's probably fine. You can always make a container class purely for context in other scenarios. That's essentially what Node.js does--you make AsyncLocalStorage instances which function as map keys to retrieve the corresponding context data from the singular internal AsyncContextFrame map.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jun 12, 2025

The unsafe TLS implementation looks interesting (...) Do you think we could be able to use unsafe mode for Thread.@@current_thread as well?

No. It only works for fiber locals. Even after a fiber or thread context switch, a pointer to a fiber local from before the switch will still be pointing to the same data after the fiber is resumed, even if the fiber is resumed by another thread.

That can't work for other thread locals: when a fiber is resumed by another thread, LLVM would reuse the TLS pointers from before the switch, which could be the wrong Thread instance, for example.

# If crystal code is called from an external thread
# created from an external library, we may need to
# create a FLS dynamically using the GC.
include Crystal::PointerLinkedList::Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you allocate the FLS for main fibers into the HEAP on demand, but I don't see why you need the linked list. You already assign the pointer to Fiber.current.fls, so the GC will see the live reference and thus won't collect the FLS, but will collect it after the thread has terminated (and we lost the reference to the Fiber) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the current design of Fiber/Thread, but I assumed that they may have allocation optimizations, like placing the Fiber class on the Fiber's stackbottom.

Could maybe be simplified.

container = GC.malloc(sizeof(self))

# There's no pointer-based unsafe_construct for struct
container.as(self*).value = self.new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would indeed be nice to have #pre_initialize for structs, too, not just references. But, we can call ptr.as(self).value.initialize, and we can redefine the #initialize method for each fiber local with a default value, then call previous_def to inject the previous initializers. For example:

struct Foo
  def self.create
    foo = GC.malloc(sizeof(self)).as(self*)
    foo.value.initialize
    foo
  end
end

struct Foo
  # must be 'uninitialized' otherwise the compiler complains (not initialized in all initializers)
  @one = uninitialized Int32

  def initialize
    @one = 123
  end
end

struct Foo
  @two = uninitialized Int32

  def initialize
    previous_def
    @two = 456
  end
end

foo = Foo.create
p foo.value # => Foo(@one=123, @two=456)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rename the first def to pre_initialize and call it explicitly from create, you get the same behaviour. But without the need for previous_def.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can just use a class, ReferenceStorage and .unsafe_construct and it would also work:

class FLS
  # ...
end

fls = uninitialized ReferenceStorage(FLS)
@fls = FLS.unsafe_construct(pointerof(fls))

Bonus: a mere FLS.new will allocate an instance in the GC HEAP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus: we can merely use FLS references instead of a pointer, and can use a property %var with direct accessors in the generated class getter/setter methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this sounds like a nice simplification. Since ReferenceStorage there should be little reason for having a struct type where you always pass pointers to it around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, we can call ptr.as(self).value.initialize

The struct currently has an empty initialize method and AFAIK the default ivar values are assigned in the .allocate method.

Or we can just use a class

class has a constant overhead of 4 (?) bytes because of its type id field.

I wanted to optimize the FLS away in case it's not used, so I used struct.

I wish we had "naked" classes or something like that.

Also, ReferenceStorage needs a modern crystal version, and we want to be able to build a new crystal compiler using an old version of the compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class has a constant overhead of 4 (?) bytes because of its type id field.

I wanted to optimize the FLS away in case it's not used, so I used struct.

True, that's a noticeable difference between the two. But I'm not sure if those extra 4 bytes are that relevant.

Type id's are only used in some specific situations. If we can be sure that never happens for this type, it should be technically possible - though unsafe, of course - to reduce the size. We could allocate 4 bytes less and decrement the pointer address by 4 before casting to the class type 🤷

ReferenceStorage is a new feature, but uninitialized UInt8[instance_sizeof(T)] should work as a polyfill. Missing pre_initialize is a bigger problem and would require something like the initialize chaining mentioned in the first comment of this thread.

The difference between struct or class is probably not that important. It's fine either way, as long as it's somewhat reasonable. We can reconsider the details later, once we have merged an MVP.

Copy link
Contributor

@ysbaddaden ysbaddaden Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly: uninitialized UInt8[instance_sizeof(T)] is missing alignment, so it must reserve alignof(T) more bytes and manually align the pointer, hence the reason why we have ReferenceStorage.

IMO it would be just fine to always allocate in the HEAP. In addition to all fibers behaving the same, it's one single malloc per fiber, and reserving/initializing the space can be delayed to on demand —when we need a FLS value, if the fiber ever needs to access 'em.

@straight-shoota
Copy link
Member

After familiarizing with the concept a bit more, I understand. Yeah, makes sense that we cannot use unsafe for Thread.current. 🤦

But I suppose we could use it to optimize Fiber.current. It is currently implemented as Thread.current.current_fiber, so it has the safe TLS overhead. But it should be able to use unsafe TLS just like the FLS struct in this patch.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jun 12, 2025

In general we'd benefit to directly have thread locals instead of going through instance variables on Thread.current.

For Fiber.current, I guess you're right. It's actually a fiber local, and as long as we set it before swapcontext (like we set the fls pointer in this PR) then it should be fine 🤔

@BlobCodes
Copy link
Contributor Author

After familiarizing with the concept a bit more, I understand. Yeah, makes sense that we cannot use unsafe for Thread.current. 🤦

We could use it for Thread.current by using the same trick currently done by codegen - using @[NoInline].

Since Thread.current accesses TLS multiple times (since it lazy-initializes the Thread class and may need to write to TLS), we could use that to replace a method using multiple noinline calls by just one noinline call that re-uses the tls pointer.

@straight-shoota
Copy link
Member

I've done some exploration of existing usage patterns for fiber local storage. Discussion in https://forum.crystal-lang.org/t/field-study-of-fiber-local-storage/8325/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants