Skip to content

Commit b775f8c

Browse files
committed
♻️ refactor(tests/infer): unify detect functions, remove detect_multilingual usage
1 parent 44c52a5 commit b775f8c

File tree

7 files changed

+180
-292
lines changed

7 files changed

+180
-292
lines changed

README.md

Lines changed: 65 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,17 @@
99
**`fast-langdetect`** is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.
1010

1111
- Supported Python `3.9` to `3.13`.
12-
- Works offline in low memory mode
12+
- Works offline with the lite model
1313
- No `numpy` required (thanks to @dalf).
1414

1515
> ### Background
1616
>
1717
> This project builds upon [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark) with enhancements in packaging.
1818
> For more information about the underlying model, see the official FastText documentation: [Language Identification](https://fasttext.cc/docs/en/language-identification.html).
1919
20-
> ### Possible memory usage
20+
> ### Memory note
2121
>
22-
> *This library requires at least **200MB memory** in low-memory mode.*
22+
> The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy. Choose the model that best fits your constraints.
2323
2424
## Installation 💻
2525

@@ -39,7 +39,7 @@ pdm add fast-langdetect
3939

4040
## Usage 🖥️
4141

42-
In scenarios **where accuracy is important**, you should not rely on the detection results of small models, use `low_memory=False` to download larger models!
42+
For higher accuracy, prefer the full model via `detect(text, model='full')`. For robust behavior under memory pressure, use `detect(text, model='auto')` which falls back to the lite model only on MemoryError.
4343

4444
### Prerequisites
4545

@@ -48,42 +48,68 @@ In scenarios **where accuracy is important**, you should not rely on the detecti
4848
- Setting `FTLANG_CACHE` environment variable
4949
- Using `LangDetectConfig(cache_dir="your/path")`
5050

51+
### Simple Usage (Recommended)
52+
53+
Call by model explicitly — clear and predictable, and use `k` to get multiple candidates. The function always returns a list of results:
54+
55+
```python
56+
from fast_langdetect import detect
57+
58+
# Lite model (offline, smaller, faster) — never falls back
59+
print(detect("Hello", model='lite', k=1)) # -> [{'lang': 'en', 'score': ...}]
60+
61+
# Full model (downloaded to cache, higher accuracy) — never falls back
62+
print(detect("Hello", model='full', k=1)) # -> [{'lang': 'en', 'score': ...}]
63+
64+
# Auto mode: try full, fallback to lite only on MemoryError
65+
print(detect("Hello", model='auto', k=1)) # -> [{'lang': 'en', 'score': ...}]
66+
67+
# Multilingual: top 3 candidates (always a list)
68+
print(detect("Hello 世界 こんにちは", model='auto', k=3))
69+
```
70+
71+
If you need a custom cache directory, pass `LangDetectConfig`:
72+
73+
```python
74+
from fast_langdetect import LangDetectConfig, detect
75+
76+
cfg = LangDetectConfig(cache_dir="/custom/cache/path")
77+
print(detect("Hello", model='full', config=cfg))
78+
```
79+
5180
### Native API (Recommended)
5281

5382
```python
54-
from fast_langdetect import detect, detect_multilingual, LangDetector, LangDetectConfig, DetectError
83+
from fast_langdetect import detect, LangDetector, LangDetectConfig, DetectError
5584

56-
# Simple detection
57-
print(detect("Hello, world!"))
58-
# Output: {'lang': 'en', 'score': 0.12450417876243591}
85+
# Simple detection (auto behavior)
86+
print(detect("Hello, world!", model='auto', k=1))
87+
# Output: [{'lang': 'en', 'score': 0.98}]
5988

60-
# Using large model for better accuracy
61-
print(detect("Hello, world!", low_memory=False))
62-
# Output: {'lang': 'en', 'score': 0.98765432109876}
89+
# Using full model for better accuracy
90+
print(detect("Hello, world!", model='full', k=1))
91+
# Output: [{'lang': 'en', 'score': 0.99}]
6392

64-
# Custom configuration with fallback mechanism
65-
config = LangDetectConfig(
66-
cache_dir="/custom/cache/path", # Custom model cache directory
67-
allow_fallback=True # Enable fallback to small model if large model fails
68-
)
93+
# Custom configuration
94+
config = LangDetectConfig(cache_dir="/custom/cache/path") # Custom model cache directory
6995
detector = LangDetector(config)
7096

7197
try:
72-
result = detector.detect("Hello world", low_memory=False)
73-
print(result) # {'lang': 'en', 'score': 0.98}
98+
result = detector.detect("Hello world", model='full', k=1)
99+
print(result) # [{'lang': 'en', 'score': 0.98}]
74100
except DetectError as e:
75101
print(f"Detection failed: {e}")
76102

77103
# Multiline text is handled automatically (newlines are replaced)
78104
multiline_text = "Hello, world!\nThis is a multiline text."
79-
print(detect(multiline_text))
80-
# Output: {'lang': 'en', 'score': 0.85}
105+
print(detect(multiline_text, k=1))
106+
# Output: [{'lang': 'en', 'score': 0.85}]
81107

82108
# Multi-language detection
83-
results = detect_multilingual(
84-
"Hello 世界 こんにちは",
85-
low_memory=False, # Use large model for better accuracy
86-
k=3 # Return top 3 languages
109+
results = detect(
110+
"Hello 世界 こんにちは",
111+
model='auto',
112+
k=3 # Return top 3 languages (auto model loading)
87113
)
88114
print(results)
89115
# Output: [
@@ -93,26 +119,11 @@ print(results)
93119
# ]
94120
```
95121

96-
#### Fallbacks
97-
98-
We provide a fallback mechanism: when `allow_fallback=True`, if the program fails to load the **large model** (`low_memory=False`), it will fall back to the offline **small model** to complete the prediction task.
99-
100-
```python
101-
# Disable fallback - will raise error if large model fails to load
102-
# But fallback disabled when custom_model_path is not None, because its a custom model, we will directly use it.
103-
import tempfile
104-
config = LangDetectConfig(
105-
allow_fallback=False,
106-
custom_model_path=None,
107-
cache_dir=tempfile.gettempdir(),
108-
)
109-
detector = LangDetector(config)
122+
#### Fallback Policy (Keep It Simple)
110123

111-
try:
112-
result = detector.detect("Hello world", low_memory=False)
113-
except DetectError as e:
114-
print("Model loading failed and fallback is disabled")
115-
```
124+
- Only MemoryError triggers fallback (in `model='auto'`): when loading the full model runs out of memory, it falls back to the lite model.
125+
- I/O/network/permission/path/integrity errors raise `DetectError` (with original exception) — no silent fallback.
126+
- `model='lite'` and `model='full'` never fallback by design.
116127

117128
### Convenient `detect_language` Function
118129

@@ -134,12 +145,9 @@ print(detect_language("你好,世界!"))
134145

135146
```python
136147
# Load model from local file
137-
config = LangDetectConfig(
138-
custom_model_path="/path/to/your/model.bin", # Use local model file
139-
disable_verify=True # Skip MD5 verification
140-
)
148+
config = LangDetectConfig(custom_model_path="/path/to/your/model.bin")
141149
detector = LangDetector(config)
142-
result = detector.detect("Hello world")
150+
result = detector.detect("Hello world", model='auto', k=1)
143151
```
144152

145153
### Splitting Text by Language 🌐
@@ -166,11 +174,14 @@ print(detector.detect("Some very long text..."))
166174
- When truncation happens, a WARNING is logged because it may reduce accuracy.
167175
- `max_input_length=80` truncates overly long inputs; set `None` to disable if you prefer no truncation.
168176

169-
### Fallback Behavior
177+
### Cache Directory Behavior
178+
179+
- Default cache: if `cache_dir` is not set, models are stored under a system temp-based directory specified by `FTLANG_CACHE` or an internal default. This directory is created automatically when needed.
180+
- User-provided cache_dir: if you set `LangDetectConfig(cache_dir=...)` to a path that does not exist, the library raises `DetectError` instead of silently creating or using another location. Create the directory yourself if that’s intended.
181+
182+
### Advanced Options (Optional)
170183

171-
- As of the latest change, the library only falls back to the bundled small model when a MemoryError occurs while loading the large model.
172-
- For other errors (e.g., I/O/permission errors, corrupted files, invalid paths), the error is raised as `DetectError` so you can diagnose the root cause quickly.
173-
- This avoids silently masking real issues and prevents unnecessary re-downloads that can slow execution.
184+
The constructor exposes a few advanced knobs (`proxy`, `normalize_input`, `max_input_length`). These are rarely needed for typical usage and can be ignored. Prefer `detect(..., model=...)` unless you know you need them.
174185

175186
### Language Codes → English Names
176187

@@ -209,8 +220,8 @@ def code_to_english_name(code: str) -> str:
209220

210221
# Usage
211222
from fast_langdetect import detect
212-
result = detect("Olá mundo", low_memory=False)
213-
print(code_to_english_name(result["lang"])) # Portuguese (Brazil) or Portuguese
223+
result = detect("Olá mundo", model='full', k=1)
224+
print(code_to_english_name(result[0]["lang"])) # Portuguese (Brazil) or Portuguese
214225
```
215226

216227
Alternatively, `pycountry` can be used for ISO 639 lookups (install with `pip install pycountry`), combined with a small override dict for non-standard tags like `pt-br`, `zh-cn`, `yue`, etc.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "fast-langdetect"
3-
version = "0.4.0"
3+
version = "1.0.0"
44
description = "Quickly detect text language and segment language"
55
authors = [
66
{ name = "sudoskys", email = "coldlando@hotmail.com" },

src/fast_langdetect/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33

44
from .infer import LangDetector, LangDetectConfig, DetectError # noqa: F401
55
from .infer import detect
6-
from .infer import detect_multilingual # noqa: F401
76

87

98
def is_japanese(string):
@@ -20,7 +19,9 @@ def detect_language(sentence: str, *, low_memory: bool = True):
2019
:param low_memory: bool (default: True) whether to use low memory mode
2120
:return: ZH, EN, JA, KO, FR, DE, ES, .... (two uppercase letters)
2221
"""
23-
lang_code = detect(sentence, low_memory=low_memory).get("lang").upper()
22+
model = "lite" if low_memory else "full"
23+
res_list = detect(sentence, model=model, k=1)
24+
lang_code = res_list[0].get("lang").upper() if res_list else "EN"
2425
if lang_code == "JA" and not is_japanese(sentence):
2526
lang_code = "ZH"
2627
return lang_code

0 commit comments

Comments
 (0)