Skip to content

Improvement regarding content decoding/encoding #32

@vbanos

Description

@vbanos

webpages.fetch uses requests to fetch the web content and returns response.text as python str.
requests does auto-encoding of the binary data it fetches from the target site from bytes to unicode str.
https://stackoverflow.com/questions/17011357/what-is-the-difference-between-content-and-text

Then, in all the other functions you pass str as input.

But later, in many functions, you need to convert back str to bytes.
For instance, you do that in languages._from_text:
https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/languages.py#L58

My suggestion is:
Use requests -> response.content which is the fetched content in bytes.

extract() method should get html_bytes as input.

Make the conversion of bytes to str a separate function and run it inside extract().

Pass bytes to the extractors that work better with bytes and str to the extractors that work better with str.

This way, you'll have:
a) a correct operation (I wonder if language detection is working 100% correctly, the content has been auto encoded/decoded before using requests and trafilatura.utils.decode_file
b) a bit faster as you'll avoid redundant bytes -> str -> bytes conversions.

Thanks

BTW, many of the external libraries you use support both str and bytes as input. E.g. readability-lxml does this:
https://github.com/buriy/python-readability/blob/master/readability/htmls.py
But its not efficient to let this conversion be done in the extractors because it will be done multiple times. (internally in each extractor). Its better to do it just once.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions