Skip to content

Commit a50fc79

Browse files
committed
🔨 **test(test_real_detection): Raise error for non-memory model load failures**
Update test to ensure non-memory errors raise `DetectError` instead of falling back to a smaller model. 📝 **docs(README): Clarify fallback behavior and add language code mapping guide** Explain conditions for model fallback and provide guidance on mapping language codes to English names using `langcodes`.
1 parent d7d255c commit a50fc79

File tree

2 files changed

+53
-5
lines changed

2 files changed

+53
-5
lines changed

README.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,55 @@ print(detector.detect("Some very long text..."))
166166
- When truncation happens, a WARNING is logged because it may reduce accuracy.
167167
- `max_input_length=80` truncates overly long inputs; set `None` to disable if you prefer no truncation.
168168

169+
### Fallback Behavior
170+
171+
- As of the latest change, the library only falls back to the bundled small model when a MemoryError occurs while loading the large model.
172+
- For other errors (e.g., I/O/permission errors, corrupted files, invalid paths), the error is raised as `DetectError` so you can diagnose the root cause quickly.
173+
- This avoids silently masking real issues and prevents unnecessary re-downloads that can slow execution.
174+
175+
### Language Codes → English Names
176+
177+
The detector returns fastText language codes (e.g., `en`, `zh`, `ja`, `pt-br`). To present user-friendly names, you can map codes to English names using a third-party library. Example using `langcodes`:
178+
179+
```python
180+
# pip install langcodes
181+
from langcodes import Language
182+
183+
OVERRIDES = {
184+
# fastText-specific or variant tags commonly used
185+
"yue": "Cantonese",
186+
"wuu": "Wu Chinese",
187+
"arz": "Egyptian Arabic",
188+
"ckb": "Central Kurdish",
189+
"kab": "Kabyle",
190+
"zh-cn": "Chinese (China)",
191+
"zh-tw": "Chinese (Taiwan)",
192+
"pt-br": "Portuguese (Brazil)",
193+
}
194+
195+
def code_to_english_name(code: str) -> str:
196+
code = code.replace("_", "-").lower()
197+
if code in OVERRIDES:
198+
return OVERRIDES[code]
199+
try:
200+
# Display name in English; e.g. 'Portuguese (Brazil)'
201+
return Language.get(code).display_name("en")
202+
except Exception:
203+
# Try the base language (e.g., 'pt' from 'pt-br')
204+
base = code.split("-")[0]
205+
try:
206+
return Language.get(base).display_name("en")
207+
except Exception:
208+
return code
209+
210+
# Usage
211+
from fast_langdetect import detect
212+
result = detect("Olá mundo", low_memory=False)
213+
print(code_to_english_name(result["lang"])) # Portuguese (Brazil) or Portuguese
214+
```
215+
216+
Alternatively, `pycountry` can be used for ISO 639 lookups (install with `pip install pycountry`), combined with a small override dict for non-standard tags like `pt-br`, `zh-cn`, `yue`, etc.
217+
169218
## Benchmark 📊
170219

171220
For detailed benchmark results, refer

tests/test_real_detection.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -93,16 +93,15 @@ def test_not_found_model(self):
9393
detector = LangDetector(config)
9494
detector.detect("Hello world", low_memory=False)
9595

96-
def test_not_found_model_with_fallback(self):
97-
"""Test fallback to small model when large model fails to load."""
96+
def test_not_found_model_without_fallback_on_io_error(self):
97+
"""Non-memory errors should not fallback; they should raise."""
9898
config = LangDetectConfig(
9999
cache_dir="/nonexistent/path",
100100
allow_fallback=True,
101101
)
102102
detector = LangDetector(config)
103-
result = detector.detect("Hello world", low_memory=False)
104-
assert result["lang"] == "en"
105-
assert 0.1 <= result["score"] <= 1.0
103+
with pytest.raises(DetectError):
104+
detector.detect("Hello world", low_memory=False)
106105

107106
@pytest.mark.real
108107
@pytest.mark.slow

0 commit comments

Comments
 (0)