-
Notifications
You must be signed in to change notification settings - Fork 5
Description
webpages.fetch
uses requests
to fetch the web content and returns response.text
as python str
.
requests
does auto-encoding of the binary data it fetches from the target site from bytes
to unicode str
.
https://stackoverflow.com/questions/17011357/what-is-the-difference-between-content-and-text
Then, in all the other functions you pass str
as input.
But later, in many functions, you need to convert back str
to bytes
.
For instance, you do that in languages._from_text
:
https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/languages.py#L58
My suggestion is:
Use requests
-> response.content
which is the fetched content in bytes
.
extract()
method should get html_bytes
as input.
Make the conversion of bytes to str a separate function and run it inside extract()
.
Pass bytes
to the extractors that work better with bytes and str to the extractors that work better with str.
This way, you'll have:
a) a correct operation (I wonder if language detection is working 100% correctly, the content has been auto encoded/decoded before using requests
and trafilatura.utils.decode_file
b) a bit faster as you'll avoid redundant bytes -> str -> bytes conversions.
Thanks
BTW, many of the external libraries you use support both str
and bytes
as input. E.g. readability-lxml
does this:
https://github.com/buriy/python-readability/blob/master/readability/htmls.py
But its not efficient to let this conversion be done in the extractors because it will be done multiple times. (internally in each extractor). Its better to do it just once.