Improvement regarding content decoding/encoding

`webpages.fetch` uses `requests` to fetch the web content and returns `response.text` as python `str`.
`requests` does **auto-encoding** of the binary data it fetches from the target site from `bytes` to unicode `str`.
https://stackoverflow.com/questions/17011357/what-is-the-difference-between-content-and-text

Then, in all the other functions you pass `str` as input.

But later, in many functions, you need to convert back `str` to `bytes`.
For instance, you do that in `languages._from_text`:
https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/languages.py#L58

My suggestion is:
Use `requests` -> `response.content` which is the fetched content in `bytes`.

`extract()` method should get `html_bytes` as input.

Make the conversion of bytes to str a separate function and run it inside `extract()`.

Pass `bytes` to the extractors that work better with bytes and str to the extractors that work better with str.

This way, you'll have:
a) a correct operation (I wonder if language detection is working 100% correctly, the content has been auto encoded/decoded before using `requests` and `trafilatura.utils.decode_file`
b) a bit faster as you'll avoid redundant bytes -> str -> bytes conversions.

Thanks

BTW, many of the external libraries you use support both `str` and `bytes` as input. E.g. `readability-lxml` does this:
https://github.com/buriy/python-readability/blob/master/readability/htmls.py
But its not efficient to let this conversion be done in the extractors because it will be done multiple times. (internally in each extractor). Its better to do it just once.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvement regarding content decoding/encoding #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improvement regarding content decoding/encoding #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions