Skip to content

Conversation

Dr-Emann
Copy link
Collaborator

@Dr-Emann Dr-Emann commented May 21, 2025

Builds on #57

Closes #61

Currently still working, but opening as a draft for early looks

Dr-Emann and others added 21 commits October 17, 2023 21:33
Remove unneeded `extern crate`s
Also, add some benchmarks against memchr
Add a test that looks for the first item in a long haystack
The memmap crate is unmaintained, instead, use the maintained memmap2 crate
Structs don't need the bounds, only the implementations
Mostly just adding #[must_use]
This speeds up the criteron benchmarks by almost 2x

I believe this is needed because e.g. Bytes::find is inlined, and calls `find`
generically, which will call PackedCompareControl methods. So the code calling
the methods will be inlined into the calling crate, but the implemetations of
the PackedCompareControl are not accessable to the code in the calling crate,
so they will end up as actual function calls. However these functions are
_super_ simple, and inlining them helps a LOT, so adding `#[inline]` to these
functions, and making their implementation available to calling crates has a
huge effect.

This was only seen when moving to criterion because previously, nightly
benchmarks were implemented in the library crate itself, and so these functions
were already elegable for inlining. Criteron results were actually more
accurate to what callers of the crate would actually see!
Per suggestion from @BurntSushi [here](tafia/quick-xml#664 (comment))

On my M1, tt appears to be slower but competitive with memchr up to memchr3,
then start being the from 5-16
We may not want to be stuck with const-constructable implementations
Move the simd-only tests to the top level

This allows testing even when sse4.2 isn't enabled: when it is
available, it will still test the simd implementation, but will test the
fallback otherwise.
This moves mentions of "simd" to be x86 specific. Also, do everything
with #[cfg], rather than requiring custom cfgs populated in the build.rs
This includes pretty frequent instances
For aarch64, we can do quite a bit better than just calling the `find`
function repeatedly: we build a bitset of 64 bits where we've already
found if they match the set of bits we're looking for. We can then
efficently iterate over those set bits.

It may be possible to do something similar in the x86 simd
implementation.
@Dr-Emann
Copy link
Collaborator Author

rust-lang/rust#127481

Looks like the unstable Pattern api changed

@kivikakk
Copy link

Fwiw, I gave this a go in a hot loop in Comrak, and it works really well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aarch64 simd implementation

3 participants