You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+65-54Lines changed: 65 additions & 54 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,17 +9,17 @@
9
9
**`fast-langdetect`** is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.
10
10
11
11
- Supported Python `3.9` to `3.13`.
12
-
- Works offline in low memory mode
12
+
- Works offline with the lite model
13
13
- No `numpy` required (thanks to @dalf).
14
14
15
15
> ### Background
16
16
>
17
17
> This project builds upon [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark) with enhancements in packaging.
18
18
> For more information about the underlying model, see the official FastText documentation: [Language Identification](https://fasttext.cc/docs/en/language-identification.html).
19
19
20
-
> ### Possible memory usage
20
+
> ### Memory note
21
21
>
22
-
> *This library requires at least **200MB memory** in low-memory mode.*
22
+
> The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy. Choose the model that best fits your constraints.
23
23
24
24
## Installation 💻
25
25
@@ -39,7 +39,7 @@ pdm add fast-langdetect
39
39
40
40
## Usage 🖥️
41
41
42
-
In scenarios **where accuracy is important**, you should not rely on the detection results of small models, use `low_memory=False`to download larger models!
42
+
For higher accuracy, prefer the full model via `detect(text, model='full')`. For robust behavior under memory pressure, use `detect(text, model='auto')` which falls back to the lite model only on MemoryError.
43
43
44
44
### Prerequisites
45
45
@@ -48,42 +48,68 @@ In scenarios **where accuracy is important**, you should not rely on the detecti
48
48
- Setting `FTLANG_CACHE` environment variable
49
49
- Using `LangDetectConfig(cache_dir="your/path")`
50
50
51
+
### Simple Usage (Recommended)
52
+
53
+
Call by model explicitly — clear and predictable, and use `k` to get multiple candidates. The function always returns a list of results:
54
+
55
+
```python
56
+
from fast_langdetect import detect
57
+
58
+
# Lite model (offline, smaller, faster) — never falls back
cache_dir="/custom/cache/path", # Custom model cache directory
67
-
allow_fallback=True# Enable fallback to small model if large model fails
68
-
)
93
+
# Custom configuration
94
+
config = LangDetectConfig(cache_dir="/custom/cache/path") # Custom model cache directory
69
95
detector = LangDetector(config)
70
96
71
97
try:
72
-
result = detector.detect("Hello world", low_memory=False)
73
-
print(result) # {'lang': 'en', 'score': 0.98}
98
+
result = detector.detect("Hello world", model='full', k=1)
99
+
print(result) #[{'lang': 'en', 'score': 0.98}]
74
100
except DetectError as e:
75
101
print(f"Detection failed: {e}")
76
102
77
103
# Multiline text is handled automatically (newlines are replaced)
78
104
multiline_text ="Hello, world!\nThis is a multiline text."
79
-
print(detect(multiline_text))
80
-
# Output: {'lang': 'en', 'score': 0.85}
105
+
print(detect(multiline_text, k=1))
106
+
# Output: [{'lang': 'en', 'score': 0.85}]
81
107
82
108
# Multi-language detection
83
-
results =detect_multilingual(
84
-
"Hello 世界 こんにちは",
85
-
low_memory=False, # Use large model for better accuracy
86
-
k=3# Return top 3 languages
109
+
results =detect(
110
+
"Hello 世界 こんにちは",
111
+
model='auto',
112
+
k=3# Return top 3 languages (auto model loading)
87
113
)
88
114
print(results)
89
115
# Output: [
@@ -93,26 +119,11 @@ print(results)
93
119
# ]
94
120
```
95
121
96
-
#### Fallbacks
97
-
98
-
We provide a fallback mechanism: when `allow_fallback=True`, if the program fails to load the **large model** (`low_memory=False`), it will fall back to the offline **small model** to complete the prediction task.
99
-
100
-
```python
101
-
# Disable fallback - will raise error if large model fails to load
102
-
# But fallback disabled when custom_model_path is not None, because its a custom model, we will directly use it.
103
-
import tempfile
104
-
config = LangDetectConfig(
105
-
allow_fallback=False,
106
-
custom_model_path=None,
107
-
cache_dir=tempfile.gettempdir(),
108
-
)
109
-
detector = LangDetector(config)
122
+
#### Fallback Policy (Keep It Simple)
110
123
111
-
try:
112
-
result = detector.detect("Hello world", low_memory=False)
113
-
except DetectError as e:
114
-
print("Model loading failed and fallback is disabled")
115
-
```
124
+
- Only MemoryError triggers fallback (in `model='auto'`): when loading the full model runs out of memory, it falls back to the lite model.
125
+
- I/O/network/permission/path/integrity errors raise `DetectError` (with original exception) — no silent fallback.
126
+
-`model='lite'` and `model='full'` never fallback by design.
result = detector.detect("Hello world", model='auto', k=1)
143
151
```
144
152
145
153
### Splitting Text by Language 🌐
@@ -166,11 +174,14 @@ print(detector.detect("Some very long text..."))
166
174
- When truncation happens, a WARNING is logged because it may reduce accuracy.
167
175
-`max_input_length=80` truncates overly long inputs; set `None` to disable if you prefer no truncation.
168
176
169
-
### Fallback Behavior
177
+
### Cache Directory Behavior
178
+
179
+
- Default cache: if `cache_dir` is not set, models are stored under a system temp-based directory specified by `FTLANG_CACHE` or an internal default. This directory is created automatically when needed.
180
+
- User-provided cache_dir: if you set `LangDetectConfig(cache_dir=...)` to a path that does not exist, the library raises `DetectError` instead of silently creating or using another location. Create the directory yourself if that’s intended.
181
+
182
+
### Advanced Options (Optional)
170
183
171
-
- As of the latest change, the library only falls back to the bundled small model when a MemoryError occurs while loading the large model.
172
-
- For other errors (e.g., I/O/permission errors, corrupted files, invalid paths), the error is raised as `DetectError` so you can diagnose the root cause quickly.
173
-
- This avoids silently masking real issues and prevents unnecessary re-downloads that can slow execution.
184
+
The constructor exposes a few advanced knobs (`proxy`, `normalize_input`, `max_input_length`). These are rarely needed for typical usage and can be ignored. Prefer `detect(..., model=...)` unless you know you need them.
print(code_to_english_name(result["lang"])) # Portuguese (Brazil) or Portuguese
223
+
result = detect("Olá mundo", model='full', k=1)
224
+
print(code_to_english_name(result[0]["lang"])) # Portuguese (Brazil) or Portuguese
214
225
```
215
226
216
227
Alternatively, `pycountry` can be used for ISO 639 lookups (install with `pip install pycountry`), combined with a small override dict for non-standard tags like `pt-br`, `zh-cn`, `yue`, etc.
0 commit comments