Skip to content

implement floor and ceil in assembly on i586 #976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 27, 2025

Conversation

folkertdev
Copy link
Contributor

@folkertdev folkertdev commented Jul 13, 2025

fixes #837

The assembly is based on

Which both state

/*
 * Written by J.T. Conklin <jtc@NetBSD.org>.
 * Public domain.
 */

Which I believe means we're good in terms of licensing.

@tgross35
Copy link
Contributor

This is awesome, thank you for implementing it!

There is one problem, our libm MSRV unfortunately doesn't support naked functions. I'll be bumping it in the near future but I don't think it will be that high; how hard is this to do with inline asm? Probably adds a few instructions because the loads stack access becomes opaque, but that would still be faster than the slow existing implementation.

@tgross35
Copy link
Contributor

Fix for the red CI in #979 btw

@folkertdev
Copy link
Contributor Author

how hard is this to do with inline asm?

Well, I don't know this target and its assembly that well. So using global_asm! would be much easier (for me, anyway) and more straightforward to upgrade once the MSRV does support naked functions.

I think the only downside is that the symbols might be visible from the outside? Is that a problem?

@folkertdev
Copy link
Contributor Author

actually we can prevent even that with a trick that Björn showed me recently using sym

@folkertdev
Copy link
Contributor Author

folkertdev commented Jul 17, 2025

from what I can find this is a 2024 edition thing


error: extern blocks must be unsafe
  --> builtins-shim/../compiler-builtins/src/math/../../../libm/src/math/arch/i586.rs:23:1
   |
23 | / extern "cdecl" {
24 | |     fn ceil_helper(_: f64) -> f64;
25 | |     fn floor_helper(_: f64) -> f64;
26 | | }
   | |_^

but rust 1.63 won't compile unsafe extern right?

@tgross35
Copy link
Contributor

tgross35 commented Jul 17, 2025

I think the only downside is that the symbols might be visible from the outside? Is that a problem?

I think it may be, also conflicts (though we could make it weak). But this should be pretty easy to move to inline asm; I think something like this would work https://rust.godbolt.org/z/P98rKEW7P, which is reasonably close to the original (not sure if there's a way to turn pointers to locals into stack offsets, to avoid the leal s)

@folkertdev
Copy link
Contributor Author

not sure if there's a way to turn pointers to locals into stack offsets, to avoid the leal s

That's probably fine, we can fix that properly once naked functions are available


we now get errors like this?

  stderr ───
    Spaced Musl sinf arg 1/1: 800 iterations (800 total)

    thread 'musl_quickspace_sinf' panicked at libm-test/tests/compare_built_musl.rs:26:50:
    called `Result::unwrap()` on an `Err` value: 
        input:    (-239880100.0,)
        as hex:   (-0x1.c988f4p+27,)
        as bits:  (0xcd64c47a,)
        expected: -0.17352822            -0x1.6362c4p-3 0xbe31b162
        actual:   -0.17353132            -0x1.636464p-3 0xbe31b232

@quaternic
Copy link
Contributor

You're clobbering the saved control word on the stack with the modified one, so it isn't restored correctly.

I think there's an argument for using a fixed control word for the frndint, but I'll have to come back to this later.

Comment on lines 18 to 25
"movw %dx, ({cw_ptr})", // Apply cw
"fldcw ({cw_ptr})", // ...
Copy link
Contributor

@tgross35 tgross35 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yeah, as @quaternic mentioned this needs a second stack slot. As a microopt this could be let mut cw_stash = [0u16; 2]; and then these accesses can be -4({stash_ptr}) so we save a register vs. two different u16 locals

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually even better, let mut cw_stash = MaybeUninit::<[u16; 2]>::uninit(); to save the zero init instruction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this to work

https://rust.godbolt.org/z/44Tjcx3bb

the -4 stuff didn't work (also, for u16, why -4?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, took that from the original asm without checking :)

@folkertdev
Copy link
Contributor Author

Somehow these changes introduce a panic somewhere??

error: linking with `cc` failed: exit status: 1
  |
  = note:  "cc" "-m32" "<1 object files omitted>" "-Wl,--as-needed" "-Wl,-Bstatic" "<sysroot>/lib/rustlib/i586-unknown-linux-gnu/lib/libcompiler_builtins-*.rlib" "-Wl,-Bdynamic" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc" "-L" "/tmp/rustcBJvx8Y/raw-dylibs" "-Wl,--eh-frame-hdr" "-Wl,-z,noexecstack" "-L" "<sysroot>/lib/rustlib/i586-unknown-linux-gnu/lib" "-o" "/home/folkertdev/rust/compiler-builtins/target/i586-unknown-linux-gnu/release-opt/deps/libm-7098e2e2ad176b85" "-Wl,--gc-sections" "-pie" "-Wl,-z,relro,-z,now" "-Wl,-O1" "-Wl,--strip-debug" "-nodefaultlibs"
  = note: some arguments are omitted. use `--verbose` to show all linker arguments
  = note: ld: error: undefined symbol: 
          
          ERROR[no-panic]: detected panic in function `rem_pio2`
          
          >>> referenced by libm.ddf30f7f94fa857e-cgu.0
          >>>               /home/folkertdev/rust/compiler-builtins/target/i586-unknown-linux-gnu/release-opt/deps/libm-7098e2e2ad176b85.libm.ddf30f7f94fa857e-cgu.0.rcgu.o:(core::ops::function::FnOnce::call_once::h239ee4d4be0e829c)
          >>> referenced by libm.ddf30f7f94fa857e-cgu.0
          >>>               /home/folkertdev/rust/compiler-builtins/target/i586-unknown-linux-gnu/release-opt/deps/libm-7098e2e2ad176b85.libm.ddf30f7f94fa857e-cgu.0.rcgu.o:(core::ops::function::FnOnce::call_once::hf87b8ecfbb337da8)
          >>> referenced by libm.ddf30f7f94fa857e-cgu.0
          >>>               /home/folkertdev/rust/compiler-builtins/target/i586-unknown-linux-gnu/release-opt/deps/libm-7098e2e2ad176b85.libm.ddf30f7f94fa857e-cgu.0.rcgu.o:(libm::math::rem_pio2::rem_pio2::h2b1fed0a2d9377e5)
          
          ld: error: undefined symbol: 
          
          ERROR[no-panic]: detected panic in function `j1f`

The culprit is floor, but eh, wut? Maybe inlining no longer happens (sufficiently) with the inline assembly?

@tgross35
Copy link
Contributor

That's bizarre. The failure didn't show up with naked_asm right? I have no idea what could be different but that seems like an interesting thread to pull...

I'll try to look on Monday but it would be good to trace back where that panic is coming from, maybe an assert_unchecked can get rid of it (I'll figure out how to do the MSRV dance around that).

@beetrees
Copy link
Contributor

After doing a bit of testing, it appears that any inline ASM block that doesn't have the pure option causes no_panic to fail.

@tgross35
Copy link
Contributor

After doing a bit of testing, it appears that any inline ASM block that doesn't have the pure option causes no_panic to fail.

Is this bug in no_panic or is does codegen for non-pure asm somehow prevent eliminating an actual panic?

If it's the first case, I think it might be okay to work around this by updating use_arch_required at https://github.com/folkertdev/compiler-builtins/blob/8f43b0b5f568c688c3504f89097221c2d16749b4/libm/src/math/floor.rs and ceil to include not(assert_no_panic).

@folkertdev
Copy link
Contributor Author

I think it's a codegen/optimization issue. When I move the floor and ceil implementations into their own binary, and add #[no_panic], it fails in debug mode (unsurprising), but succeeds with --release. So it is not the assembly inherently.

https://godbolt.org/z/aad7TYaWf (with no_panic turned on locally, godbolt does not have that sadly)

On the other hand in the compiler-builtins codebase even a asm!("nop") is enough to trigger a panic apparently. Neither inline(always) nor inline(never) changes that.

As far as I can tell there is no actual panic there (see e.g. the godbolt)

@quaternic
Copy link
Contributor

I think there's an argument for using a fixed control word for the frndint, but I'll have to come back to this later.

So, what I had in mind there was that while the implementation from NetBSD is surely more correct in supporting unmasked exceptions, we really don't have that support anywhere in the language (except maybe naked functions) so a chunk of it is unnecessary.

I believe this should work just as well:
(using intel syntax)

extern "C" fn floor(mut x: f64) -> f64 {
    unsafe {
        core::arch::asm!(
            "fld qword ptr [{x}]",
            "fstcw [{x}]",
            "mov word ptr [{x} + 2], 0x077f",
            "fldcw [{x} + 2]",
            "frndint",
            "fldcw [{x}]",
            "fstp qword ptr [{x}]",
            x = in(reg) &mut x,
            out("st(0)") _, out("st(1)") _,
            out("st(2)") _, out("st(3)") _,
            out("st(4)") _, out("st(5)") _,
            out("st(6)") _, out("st(7)") _,
            options(nostack),
        );
    }
    x
}

https://godbolt.org/z/8b6a8Yxn1
It's a bit of a hack to reuse the memory of x for the control words, but since the memory operand is needed regardless, might as well.

@folkertdev
Copy link
Contributor Author

That implementation has the right behavior, but still runs into the panic issue.

@quaternic
Copy link
Contributor

quaternic commented Jul 20, 2025

Yeah, I didn't mean to imply it would do anything for that, and there's currently no way to make x87 inline assembly pure since that would require support for either memory operands or x87 register operands.


As for the panic, it seems that all the errors trace back to the same source: rem_pio2_large contains one call to floor, and inserting asm!("") next to that introduces the same no-panic errors.

Since that call looks to be replaceable with trunc, that would be a possible workaround. (Testing now: #982 )

@tgross35
Copy link
Contributor

Does making the functions extern "C" change things? I wonder if that could be the difference with the naked implementation, because of assumptions around unwinding with extern ABIs.

If that doesn't work, I think the change in #982 is fine if it gets us out of this issue.

I think it's a codegen/optimization issue. When I move the floor and ceil implementations into their own binary, and add #[no_panic], it fails in debug mode (unsurprising), but succeeds with --release. So it is not the assembly inherently.

Do you mean putting the definitions in a separate object file but still checking the panic in rem_pio2? If so, that could be falling into no_panic needing LTO

# Release with maximum optimizations, which is very slow to build. This is also
# what is needed to check `no-panic`.
[profile.release-opt]
inherits = "release"
codegen-units = 1
lto = "fat"

@folkertdev
Copy link
Contributor Author

Locally at least, for floor the combination of extern "C" with inline(never) gets rid of the panics. I'm assuming that is fine (e.g. the naked function version also would not get inlined).

Copy link
Contributor

@tgross35 tgross35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small nits but it looks good to me, @quaternic could you also review?

I do think it is okay to use #982 as well so we can allow floor to be inlined

Comment on lines 14 to 15
let mut cw_stash = core::mem::MaybeUninit::<u16>::uninit();
let mut cw_tmp = core::mem::MaybeUninit::<u16>::uninit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be imported since it is used a few times?

} else {
return x;
// NOTE: this `inline(never)` is load-bearing for making functions that use it panic-free.
// Without it `rem_pio2_large` (and any function that uses it) will contain panics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they actually contain panics? Or is it a false positive that the pattern used by no_panic doesn't handle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to actually pinpoint where the reported panic really is. e.g. https://godbolt.org/z/TcEGeYos9 doesn't show any panics. So there is a chance that it's a limitation in no_panic.

@folkertdev folkertdev force-pushed the i586-floor-ceil branch 2 times, most recently from 515623f to 37b4f27 Compare July 22, 2025 19:28
@quaternic
Copy link
Contributor

Does making the functions extern "C" change things? I wonder if that could be the difference with the naked implementation, because of assumptions around unwinding with extern ABIs.

I did some more testing, and I think this might be an issue in rustc (well, more of a feature request). Functions containing inline asm with an ABI that allows unwinding are considered potentially unwinding (so don't get the nounwind attribute applied). Calls to such functions then generate landing pads and cause no_panic to fail. This is even if the asm is marked pure (as long as the result is used).

https://godbolt.org/z/x9cYPovMz

So my understanding is that in this case:

  • floor passes no_panic with either ABI, since it doesn't contain any calls.
  • rem_pio2_large passes no_panic if floor is extern "C" or gets inlined
  • However, e.g. rem_pio2f fails no_panic since the call to rem_pio2_large isn't inlined (since it's sufficiently complex and also called elsewhere), it is extern "Rust" (may unwind), and it contains the asm from the inlined floor.

So, to fix this specific case while still allowing inlining in the general case, the simplest fix that I could think of is to wrap the call with#[inline(never)] only in rem_pio2_large specifically.


I've had some doubts that substituting trunc there might not be correct in some edge case, so it's probably safer to not touch the original logic without more thought. I've repurposed #982 into the above idea. That's using #[cfg(x86_no_sse)] as the toggle, but it should probably match whatever is used to choose this implementation.


Other that that, this looks all good by me. Thanks for writing proper comments for the asm!

@tgross35
Copy link
Contributor

Thank you for looking into this more. Would you mind writing up a rust-lang/rust issue with your findings? That would be good to link from the fixme in #982.

@tgross35
Copy link
Contributor

Functions containing inline asm with an ABI that allows unwinding are considered potentially unwinding

Shouldn't extern "C" disallow unwinding?

I'm wondering what a solution would look like here. Are you thinking something like a nounwind asm option?

@quaternic
Copy link
Contributor

Created issue at rust-lang/rust#144518, there's some more context there.

extern "C" does disallow unwinding, and it's also currently just UB to unwind out of asm!. The problem here happens when the extern "C"-function is inlined into something with an ABI that does allow unwinding, and then that is called (and not inlined).

tgross35 pushed a commit that referenced this pull request Jul 27, 2025
Possible workaround for
#976 (comment)

Inline assembly in the body of a function currently causes the compiler
to consider that function possibly unwinding, even if said asm
originated from inlining an `extern "C"` function. This patch wraps the
problematic callsite with `#[inline(never)]`.

use super::super::fabs;
// FIXME: when the MSRV allows, use naked functions instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there actually any reason to do this now? Being able to inline this assembly should be advantageous, and from a quick look at godbolt it looks like this function is only 13 instructions (compared to 14 from the NetBSD implementations).

@tgross35
Copy link
Contributor

Okay! I think this just needs to rebase to drop the first commit, and drop the extern "C" (assuming no_panic works without it now) then should be all set.

@folkertdev
Copy link
Contributor Author

Cool, I've made those changes and CI is happy.

@tgross35 tgross35 merged commit a4f24dc into rust-lang:master Jul 27, 2025
38 checks passed
@tgross35
Copy link
Contributor

Thanks for making this happen!

Cc @programmerjake since this was your suggestion a long time ago

@tgross35
Copy link
Contributor

The output, for reference:

.section .text.ceil,"ax",@progbits
        .hidden ceil
        .weak   ceil
        .p2align        4
.type   ceil,@function
ceil:
        .cfi_startproc
        sub esp, 12
        .cfi_def_cfa_offset 16
        fld qword ptr [esp + 16]
        fstp qword ptr [esp]
        mov eax, esp
        #APP

        fld qword ptr [eax]
        wait
        fnstcw word ptr [eax]
        mov word ptr [eax + 2], 2943
        fldcw word ptr [eax + 2]
        frndint
        fldcw word ptr [eax]
        fstp qword ptr [eax]

        #NO_APP
        fld qword ptr [esp]
        add esp, 12
        .cfi_def_cfa_offset 4
        ret

.section .text.floor,"ax",@progbits
        .hidden floor
        .weak   floor
        .p2align        4
.type   floor,@function
floor:
        .cfi_startproc
        sub esp, 12
        .cfi_def_cfa_offset 16
        fld qword ptr [esp + 16]
        fstp qword ptr [esp]
        mov eax, esp
        #APP

        fld qword ptr [eax]
        wait
        fnstcw word ptr [eax]
        mov word ptr [eax + 2], 1919
        fldcw word ptr [eax + 2]
        frndint
        fldcw word ptr [eax]
        fstp qword ptr [eax]

        #NO_APP
        fld qword ptr [esp]
        add esp, 12
        .cfi_def_cfa_offset 4
        ret

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace i586 ceil and floor implementations with assembly to fix +/-0
4 participants