Skip to content

Assess tweaks to content extraction to remove headlines at end of article #86

@rahulbot

Description

@rahulbot

After some digging on mediacloud/story-indexer#278 it looks like tweaking of integration of Trafilatura to use favor_precision=True could help. In the sample code I provided on a few test cases from our researchers it helped in 3/4 cases. This needs more vetting to gauge impacts to consider rolling out the change.

Test case (change the favor_precision variable to see results):

import trafilatura
import requests
MEDIA_CLOUD_USER_AGENT = 'Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)'

def is_text_in_webpage_content(txt:str, url:str) -> bool:
    req = requests.get(url, headers={'User-Agent': MEDIA_CLOUD_USER_AGENT},timeout=30)
    parsed = trafilatura.bare_extraction(req.text, only_with_metadata=False, url=url,
                                         include_images=False, include_comments=False,
                                        favor_precision=True)
    content_text = parsed['text']
    return txt in content_text

print(is_text_in_webpage_content(
    'Thai Official',  # item on bottom of page in "Latest News" section
    'https://www.ibtimes.co.uk/falling-inflation-shifts-focus-when-ecb-could-cut-rates-1722106'))
print(is_text_in_webpage_content(
    'HIV from Terrence Higgins to Today',  # <li> under the "listen on sounds" banner after article
    'https://www.bbc.co.uk/sport/football/67640638'))
print(is_text_in_webpage_content(
    'Madhuri Dixit',  # title of an item in the featured movie below the main content area
    'https://timesofindia.indiatimes.com/videos/lifestyle/fashion/10-indian-saris-every-woman-should-have-in-her-wardrobe/videoshow/105809845.cms'))
print(is_text_in_webpage_content(
    'Immigration, Ukraine',  # title of an item in the "most popular" sidebar content
    'https://www.bfmtv.com/cote-d-azur/nice-25-personnes-expulsees-lors-d-operations-anti-squat-menees-dans-le-quartier-des-liserons_AN-202312150639.html'))

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

Todo

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions