Skip to content

detection of (almost) hidden text in html #72

@arpitest

Description

@arpitest

Hi! I'm developing spam filters, and have to parse html emails to plain text to analyze. I've used html2text and later my own simplified implementation, but inscriptis looks even better!

Is it possible to implement optional filtering/ignoring of hidden text parts? Text written using very small font size or font color equal (or close to) background color... sometimes this is defined in css/style tags, sometimes in span tag's parameters.
This technique is often used on webpages and spam emails to fool search engines and spam filters with fake content not visible to human viewers.

Here is a sample: http://thot.banki.hu/deepspam/poison.html

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions