-
-
Notifications
You must be signed in to change notification settings - Fork 33.9k
gh-143658: importlib.metadata: Use str.translate to improve performance of importlib.metadata.Prepared.normalized
#143660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
76e2272
00c76fc
c1bb3cc
0eae552
1657880
ee0e6aa
7b7f9a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -890,6 +890,14 @@ def search(self, prepared: Prepared): | |
| return itertools.chain(infos, eggs) | ||
|
|
||
|
|
||
| # Translation table for Prepared.normalize: lowercase and | ||
| # replace "-" (hyphen) and "." (dot) with "_" (underscore). | ||
| _normalize_table = str.maketrans( | ||
| "ABCDEFGHIJKLMNOPQRSTUVWXYZ-.", | ||
| "abcdefghijklmnopqrstuvwxyz__", | ||
| ) | ||
|
|
||
|
|
||
| class Prepared: | ||
| """ | ||
| A prepared search query for metadata on a possibly-named package. | ||
|
|
@@ -925,7 +933,13 @@ def normalize(name): | |
| """ | ||
| PEP 503 normalization plus dashes as underscores. | ||
| """ | ||
| return re.sub(r"[-_.]+", "-", name).lower().replace('-', '_') | ||
| # Emulates ``re.sub(r"[-_.]+", "-", name).lower()`` from PEP 503 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @hugovk I did a quick scan of the 8.34M package names, and 3.17M are purely lowercase with no separators. Given that, I tried to add a fast path check here before we normalize the table and found strong improvements in the benchmark. I think the most readable version of the fast path would be:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it's worth it. What Hugo suggested is readable enough.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's not an unreasonable position. My reasoning was that a significant portion of packages (roughly 38%) are already alphanumeric and lowercase. This fast path allows skipping the translation and loop overhead for the most common case. I felt the performance gain for those users justified the small increase in complexity, but I'm happy to defer to your preference on the balance between speed and code footprint.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How much performance gain are we speaking about though?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They're very close. If anything, the "fast path" seems to be a bit slower :) ❯ # main
❯ ./python.exe -m timeit -s "from importlib.metadata import Prepared" "Prepared.normalize('pillow')"
1000000 loops, best of 5: 390 nsec per loop
❯ ./python.exe -m timeit -s "from importlib.metadata import Prepared" "Prepared.normalize('pillow')"
1000000 loops, best of 5: 393 nsec per loop❯ # PR
❯ ./python.exe -m timeit -s "from importlib.metadata import Prepared" "Prepared.normalize('pillow')"
5000000 loops, best of 5: 95.8 nsec per loop
❯ ./python.exe -m timeit -s "from importlib.metadata import Prepared" "Prepared.normalize('pillow')"
5000000 loops, best of 5: 96 nsec per loop❯ # fast path
❯ ./python.exe -m timeit -s "from importlib.metadata import Prepared" "Prepared.normalize('pillow')"
5000000 loops, best of 5: 94.3 nsec per loop
❯ ./python.exe -m timeit -s "from importlib.metadata import Prepared" "Prepared.normalize('pillow')"
5000000 loops, best of 5: 97.5 nsec per loop
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Running a slightly modified benchmark as the above (timeit + best of 3), on my Macbook (Apple Silicon), main branch with debug build of cPython:
So for me, about 18.4% reduction in total time (or +22% speedup) on the full pypi benchmarl.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, we posted at about the same time. Interesting results from my end compared to yours, but would considers your canonical (especially since I'm on debug build with the extra overhead), so please disregard my comments then @hugovk :) |
||
| # About 3x faster, safe since packages only support alphanumeric characters | ||
| value = name.translate(_normalize_table) | ||
| # Condense repeats (faster than regex) | ||
| while "__" in value: | ||
| value = value.replace("__", "_") | ||
| return value | ||
|
|
||
| @staticmethod | ||
| def legacy_normalize(name): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| :mod:`importlib.metadata`: Use :meth:`str.translate` to improve performance of | ||
| :meth:`!importlib.metadata.Prepared.normalize`. Patch by Hugo van Kemenade and | ||
| Henry Schreiner. |
Uh oh!
There was an error while loading. Please reload this page.