Skip to content

Small micro optimisations in w3lib.http #246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 27, 2025
Merged

Conversation

abebus
Copy link
Contributor

@abebus abebus commented Jul 26, 2025

This PR includes some small micro-optimizations in w3lib.http. They're not huge changes, but they help improve performance a bit here and there.

I'm really thankful for all the amazing work done by the Scrapy team and contributors — I’ve learned a lot from the ecosystem. This PR is my small way of saying thanks.

I'm currently very into micro-optimizations, but I hope to keep contributing more (and maybe more useful things!) over time.

Also, please let me know: should I add a pytest-benchmark suite, or is a simple synthetic benchmark per PR enough?

Thanks for your time

@abebus
Copy link
Contributor Author

abebus commented Jul 26, 2025

I used the following script for benchmarking:

import timeit

from w3lib.http import headers_dict_to_raw, headers_raw_to_dict

# Sample headers for testing
headers_raw = (
    b"Content-Type: text/html\r\n"
    b"Accept: gzip\r\n"
    b"Cache-Control: no-cache\r\n"
    b"X-Test: value1\r\n"
    b"X-Test: value2\r\n"
)

headers_dict = {
    b"Content-Type": [b"text/html"],
    b"Accept": [b"gzip"],
    b"Cache-Control": [b"no-cache"],
    b"X-Test": [b"value1", b"value2"],
}

# Benchmark wrappers
def benchmark_headers_raw_to_dict():
    headers_raw_to_dict(headers_raw)

def benchmark_headers_dict_to_raw():
    headers_dict_to_raw(headers_dict)

# Run benchmarks
if __name__ == "__main__":
    raw_to_dict_time = timeit.timeit(benchmark_headers_raw_to_dict, number=1_000_000)
    dict_to_raw_time = timeit.timeit(benchmark_headers_dict_to_raw, number=1_000_000)

    print("Benchmark results (1,000,000 runs):")
    print(f"headers_raw_to_dict: {raw_to_dict_time:.4f} seconds")
    print(f"headers_dict_to_raw: {dict_to_raw_time:.4f} seconds")

On my machine, here are the best results (out of 10 runs):

Before changes:

headers_raw_to_dict: 2.4984 seconds
headers_dict_to_raw: 2.9800 seconds

After changes:

headers_raw_to_dict: 2.2313 seconds
headers_dict_to_raw: 2.1083 seconds

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@Gallaecio Gallaecio requested a review from wRAR July 26, 2025 18:24
@abebus
Copy link
Contributor Author

abebus commented Jul 27, 2025

In my latest commit, headers_raw_to_dict now completes in approximately 2.004 seconds in the benchmark script I provided. I believe the speedup comes from reducing Python function calls by using BytesIO directly, which delegates iteration to the underlying C implementation.

Copy link

codecov bot commented Jul 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.96%. Comparing base (5423e0a) to head (a5d30c0).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #246      +/-   ##
==========================================
+ Coverage   97.92%   97.96%   +0.03%     
==========================================
  Files           9        9              
  Lines         483      491       +8     
  Branches       78       83       +5     
==========================================
+ Hits          473      481       +8     
  Misses          6        6              
  Partials        4        4              
Files with missing lines Coverage Δ
w3lib/http.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wRAR
Copy link
Member

wRAR commented Jul 27, 2025

I don't have experience with pytest-benchmark, but if you have any specific ideas you are welcome to implement them, it's fine to do as a separate PR.

@abebus
Copy link
Contributor Author

abebus commented Jul 27, 2025

Okay, will do, but I think this might require a major refactor of the tests or lead to quite a bit of code duplication, unfortunately... As a reference, I’ll use the aiohttp benchmarking suite; they also use codspeed for generating nice benchmarking reports.

…uce time of `headers_dict_to_raw` to 1.7883 seconds in benchmark
w3lib/http.py Outdated
if not headers_dict:
return b""

parts = b""
Copy link
Contributor Author

@abebus abebus Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, concatenating plain bytes is noticeably faster than using bytearray. I believe this is thanks to Python’s "new" adaptive interpreter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's only faster with relatively small headers — see #247. I had incorrectly assumed that bytes would be faster due to adaptive interpreter optimizations. But now I believe the real reason is that strings and bytes don’t always recreate themselves on each concatenation, as explained here

@abebus
Copy link
Contributor Author

abebus commented Jul 27, 2025

So, summing up, the previous benchmark times were:

headers_raw_to_dict: 2.4984 
headers_dict_to_raw: 2.9800 seconds

And now:

headers_raw_to_dict: 2.0040 seconds
headers_dict_to_raw: 1.7675 seconds

So, headers_raw_to_dict is approximately 20% faster, and headers_dict_to_raw is about 40% faster.

@abebus
Copy link
Contributor Author

abebus commented Jul 27, 2025

Do I need to fix the codecov report with missing base or will the problem be automatically resolved after the merge?

@wRAR
Copy link
Member

wRAR commented Jul 27, 2025

codecov has some problems recently, I don't think we can do anything about that.

Copy link
Member

@wRAR wRAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Please let us know when you think it's ready for merging.

@abebus
Copy link
Contributor Author

abebus commented Jul 27, 2025

I believe it’s ready to merge. I’ve done my best on these functions and don’t plan any further changes :)

@wRAR wRAR merged commit f45e3ff into scrapy:master Jul 27, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants