-
Notifications
You must be signed in to change notification settings - Fork 482
Optional Language Detect #932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
141 commits
Select commit
Hold shift + click to select a range
1e86099
test
gavishpoddar 2c6e29d
test
gavishpoddar 176565b
creating basic structure
gavishpoddar d1c8678
commit
gavishpoddar c8e22d2
updates
gavishpoddar 39a2491
implimenting language library
gavishpoddar 32321cf
custom language detect workable model
gavishpoddar 499a29d
custom language parser updates
gavishpoddar 9a41213
lang_detect implimentation
gavishpoddar 20c5b0e
fixing error handling
gavishpoddar f6a2098
template update
gavishpoddar a3de5d5
fixes
gavishpoddar 6531fd8
optional language detect in search_dates
gavishpoddar 28fbd94
fixing language detection loader
gavishpoddar edfc760
fixes
gavishpoddar 51f67f0
fiexs
gavishpoddar 584f40b
fixing language detection
gavishpoddar b2d1ad6
fixing code and PEP8
gavishpoddar 2b6f195
removing models
gavishpoddar b0c1ad0
fixes on search_data
gavishpoddar 82ce3f1
restruction functions
gavishpoddar 8a4268a
creating tox tests
gavishpoddar 36a88d7
Update dateparser/date.py
gavishpoddar e04faf8
Update tests/test_language_detect.py
gavishpoddar 270faf6
minor fixes
gavishpoddar adb8d3f
Update tests/test_language_detect.py
gavishpoddar 6976a65
Update tests/test_language_detect.py
gavishpoddar ad04bcf
fixes
gavishpoddar 4790d97
fixes
gavishpoddar 7bc43af
updates
gavishpoddar a313d33
Update dateparser/search/search.py
gavishpoddar 7fda13e
exception handling
gavishpoddar 20aba0a
minor fixes
gavishpoddar 57f2cca
fixes
gavishpoddar e6e3ed2
Update dateparser/search/search.py
gavishpoddar d4e68c7
Update dateparser/search/search.py
gavishpoddar 205e29f
fixing tests and search_date default langauge
gavishpoddar 345c18e
WIP: map_languages and fixes for USE_STRICT
gavishpoddar d25f3b2
fixing language_maps structure and WIP docs
gavishpoddar 86df926
updating mapping
gavishpoddar d5764bf
passing settings as param to language detect.
gavishpoddar 63262cf
WIP : Documentation
gavishpoddar 2fd53c4
Fixing : DEFAULT_LANGUAGES
gavishpoddar adb3fcc
Updating : Docs
gavishpoddar 2aa4146
Updating Docs
gavishpoddar 6f32301
Fixing Docs
gavishpoddar 2ce7e07
Fixing langdetect global state issue
gavishpoddar 172361f
WIP:Language Map
gavishpoddar 23c959d
Updating language_info with language_map
gavishpoddar 339f7f6
WIP:Download Manager
gavishpoddar 580a859
Updating setup.py
gavishpoddar 2a81f26
complete : datearser-download
gavishpoddar 0b7607a
download_manager HTTP error handling
gavishpoddar bbfebc9
Updating docs custom_lang_detect
gavishpoddar f6e8c7b
Updating date.py `text` param
gavishpoddar 4f134d3
Update docs
gavishpoddar 50b7224
dateparser-download setting default dir
gavishpoddar 11b55d1
Merge branch 'language' of https://github.com/gavishpoddar/dateparser…
gavishpoddar 2971134
updating params position
gavishpoddar 3836548
Updating docs
gavishpoddar 19ca2ff
Fixning docs
gavishpoddar c715e58
Implimenting clear_cache and remaning detect_lang_function
gavishpoddar 03cd5b5
Updating Docs
gavishpoddar d6448d0
Updating Docs: Apply suggestions from code review
gavishpoddar 639fb0a
Updating: Docs
gavishpoddar 25780f8
Commenting test
gavishpoddar a5fe589
Minor fixes
gavishpoddar 82cad00
print -> logging
gavishpoddar a684fdf
DEFAULT_LANGUAGE works without optional langauge detection
gavishpoddar 03dd0be
fixning check_data_model_home_existance()
gavishpoddar df0d54a
implimenting argparse in dateparser-download
gavishpoddar 54ccdf8
Updating docs
gavishpoddar 2c8c007
Updating tests
gavishpoddar b5e0a30
Commenting test
gavishpoddar 985dad6
Updating tests
gavishpoddar 6745790
Removing fasttext default confidence_threshold and removing results l…
gavishpoddar 69ea22f
caching fasettext model
gavishpoddar e184092
Fixing tests
gavishpoddar 239f4a3
fixing texts and removinf confidence_threshold
gavishpoddar abab857
improving coverage
gavishpoddar ffea246
updating settings
gavishpoddar e199252
updating tests codecov
gavishpoddar 9266b06
Creating new codecov tests in test_language_detect
gavishpoddar 7073f39
removing unnecessary files
gavishpoddar fdc5d17
Minor improvement : map_languages.py += to .extend()
gavishpoddar 4cabce8
fixing _load_fasttext_model()
gavishpoddar 08e416e
updating dateparser-cli
gavishpoddar 68f7e10
adding support for windows
gavishpoddar 4601424
Adding exception for file not found
gavishpoddar d0d2cc2
Fixing: fasttext working in windows
gavishpoddar 3cd316b
Fixing tests for python 3.5
gavishpoddar c68b2f3
Improving dateparser-download
gavishpoddar 7139c79
Updating langdetect.py `_get_language_probablities`
gavishpoddar 3ac2fd9
Update += to .extend() in date.py
gavishpoddar 8896467
updating settings.rst
gavishpoddar 1833f1f
improvements: dateparser-download
gavishpoddar 1318afd
creating setting
gavishpoddar 5401b88
Updating docs
gavishpoddar 0507b48
adding comments in langdetect.py
gavishpoddar 166ac33
Making Factory in langdetect.py locally accessible
gavishpoddar 6e6eb14
minor fixes
gavishpoddar 1fb9768
LANGUAGE_DETECTION_CONFIDENCE_THRESHOLD aditionat checks and test
gavishpoddar a95cf0c
updating tests
gavishpoddar dd76db3
minor fixes and improvements
gavishpoddar ea5be90
fixes
gavishpoddar 5de7f98
improving language_mapping
gavishpoddar 67cfad2
improvising dateparser
gavishpoddar 3cb5257
improving dateparser-cli
gavishpoddar 4b8f850
renamming create_language_maps to generate_language_map
gavishpoddar 8de4d3d
DEFAULT_WIXDOWS_CACHE_DIR python 3.5 compitability
gavishpoddar 05f43bb
improving:detect_languages_function
gavishpoddar b3db7d7
Updating docs
gavishpoddar 6c1475a
trying to resolve git conflicting files
gavishpoddar b8a25dc
Trying to even this base with master
gavishpoddar 57d3386
Merge branch 'master' into language
gavishpoddar 51d1b80
Creating default_languages extra_check
gavishpoddar 2602650
Improvements
gavishpoddar 1a035ea
Apply suggestions from code review
gavishpoddar 10bd877
micro improvements from review
gavishpoddar ab15a3b
Merge branch 'language' of https://github.com/gavishpoddar/dateparser…
gavishpoddar ebe64f9
fixing typo
gavishpoddar 1ba3dc2
removing env variable
gavishpoddar a2c999f
return type checking in language_mapping
gavishpoddar 88d080e
Updates from code review
gavishpoddar 909c5f3
commit of checks
gavishpoddar 658e11f
Apply suggestions from code review
gavishpoddar 5a17537
Apply suggestions from code review
gavishpoddar 3b296db
updating tests and docs
gavishpoddar cbb2564
Merge branch 'language' of https://github.com/gavishpoddar/dateparser…
gavishpoddar b6cbded
updating docs
gavishpoddar 5b19b86
Update dateparser/data/languages_info.py
gavishpoddar 03a689c
removing __init__.py
gavishpoddar 6a78400
Merge branch 'scrapinghub:master' into language
gavishpoddar dd59484
Merge branch 'language' of https://github.com/gavishpoddar/dateparser…
gavishpoddar 0555e38
adding tests
gavishpoddar 5df60c2
adding __init__.py file
gavishpoddar e68260a
updating dateparser-downloads and docs
gavishpoddar 141d2ca
Merge branch 'scrapinghub:master' into language
gavishpoddar 001a9d7
updating dateparser-download
gavishpoddar ae36bc6
Merge branch 'language' of https://github.com/gavishpoddar/dateparser…
gavishpoddar b8dcf7b
PIP8 : new line
gavishpoddar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
import os | ||
|
||
import fasttext | ||
|
||
from dateparser_cli.fasttext_manager import fasttext_downloader | ||
from dateparser_cli.utils import dateparser_model_home, create_data_model_home | ||
from dateparser_cli.exceptions import FastTextModelNotFoundException | ||
|
||
|
||
_supported_models = ["large.bin", "small.bin"] | ||
_DEFAULT_MODEL = "small" | ||
|
||
|
||
class _FastTextCache: | ||
model = None | ||
|
||
|
||
def _load_fasttext_model(): | ||
if _FastTextCache.model: | ||
return _FastTextCache.model | ||
create_data_model_home() | ||
downloaded_models = [ | ||
file for file in os.listdir(dateparser_model_home) | ||
if file in _supported_models | ||
] | ||
if not downloaded_models: | ||
fasttext_downloader(_DEFAULT_MODEL) | ||
return _load_fasttext_model() | ||
model_path = os.path.join(dateparser_model_home, downloaded_models[0]) | ||
if not os.path.isfile(model_path): | ||
raise FastTextModelNotFoundException('Fasttext model file not found') | ||
_FastTextCache.model = fasttext.load_model(model_path) | ||
return _FastTextCache.model | ||
|
||
|
||
def detect_languages(text, confidence_threshold): | ||
_language_parser = _load_fasttext_model() | ||
text = text.replace('\n', ' ').replace('\r', '') | ||
language_codes = [] | ||
parser_data = _language_parser.predict(text) | ||
for idx, language_probability in enumerate(parser_data[1]): | ||
if language_probability > confidence_threshold: | ||
language_code = parser_data[0][idx].replace("__label__", "") | ||
language_codes.append(language_code) | ||
return language_codes |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
import langdetect | ||
kishan3 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
noviluni marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# The below _Factory is set to prevent setting global state of the library | ||
noviluni marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# but still get consistent results. | ||
# Refer : https://github.com/Mimino666/langdetect | ||
|
||
class _Factory: | ||
data = None | ||
|
||
|
||
def _init_factory(): | ||
if _Factory.data is None: | ||
_Factory.data = langdetect.detector_factory.DetectorFactory() | ||
_Factory.data.load_profile(langdetect.detector_factory.PROFILES_DIRECTORY) | ||
_Factory.data.seed = 0 | ||
|
||
|
||
def _get_language_probablities(text): | ||
_init_factory() | ||
detector = _Factory.data.create() | ||
detector.append(text) | ||
return detector.get_probabilities() | ||
|
||
|
||
def detect_languages(text, confidence_threshold): | ||
language_codes = [] | ||
try: | ||
parser_data = _get_language_probablities(text) | ||
for language_candidate in parser_data: | ||
if language_candidate.prob > confidence_threshold: | ||
language_codes.append(language_candidate.lang) | ||
except langdetect.lang_detect_exception.LangDetectException: | ||
# This exception can be produced with empty strings or inputs without letters like `10-10-2021`. | ||
# As this could be really common, we ignore them. | ||
pass | ||
kishan3 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return language_codes | ||
noviluni marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
from dateparser.data.languages_info import language_map | ||
|
||
|
||
def map_languages(language_codes): | ||
""" | ||
Returns the candidates from the supported languages codes. | ||
:param language_codes: | ||
A list of language codes, e.g. ['en', 'es'] in ISO 639 Standard. | ||
:type language_codes: list | ||
:return: Returns list[str] representing supported languages | ||
:rtype: list[str] | ||
""" | ||
return [ | ||
language_code | ||
for language in language_codes | ||
if language in language_map | ||
for language_code in language_map[language] | ||
] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.