-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
After some digging on mediacloud/story-indexer#278 it looks like tweaking of integration of Trafilatura to use favor_precision=True
could help. In the sample code I provided on a few test cases from our researchers it helped in 3/4 cases. This needs more vetting to gauge impacts to consider rolling out the change.
Test case (change the favor_precision
variable to see results):
import trafilatura
import requests
MEDIA_CLOUD_USER_AGENT = 'Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)'
def is_text_in_webpage_content(txt:str, url:str) -> bool:
req = requests.get(url, headers={'User-Agent': MEDIA_CLOUD_USER_AGENT},timeout=30)
parsed = trafilatura.bare_extraction(req.text, only_with_metadata=False, url=url,
include_images=False, include_comments=False,
favor_precision=True)
content_text = parsed['text']
return txt in content_text
print(is_text_in_webpage_content(
'Thai Official', # item on bottom of page in "Latest News" section
'https://www.ibtimes.co.uk/falling-inflation-shifts-focus-when-ecb-could-cut-rates-1722106'))
print(is_text_in_webpage_content(
'HIV from Terrence Higgins to Today', # <li> under the "listen on sounds" banner after article
'https://www.bbc.co.uk/sport/football/67640638'))
print(is_text_in_webpage_content(
'Madhuri Dixit', # title of an item in the featured movie below the main content area
'https://timesofindia.indiatimes.com/videos/lifestyle/fashion/10-indian-saris-every-woman-should-have-in-her-wardrobe/videoshow/105809845.cms'))
print(is_text_in_webpage_content(
'Immigration, Ukraine', # title of an item in the "most popular" sidebar content
'https://www.bfmtv.com/cote-d-azur/nice-25-personnes-expulsees-lors-d-operations-anti-squat-menees-dans-le-quartier-des-liserons_AN-202312150639.html'))
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
Todo