Skip to content

Commit c3f0e25

Browse files
authored
Update README.md
1 parent a20c513 commit c3f0e25

File tree

1 file changed

+305
-1
lines changed

1 file changed

+305
-1
lines changed

README.md

Lines changed: 305 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,305 @@
1-
# Parallel URLs classifier
1+
# Parallel URLs Classifier
2+
3+
`parallel-urls-classifier` (PUC) is a tool implemented in Python that allows to infer whether a pair of URLs link to parallel documents (i.e., documents with the same content but written in different languages). You can either get a textual description `positive`/`negative` or the probability that the URL pair links to parallel documents.
4+
5+
The code provided in this repo allows to train new models. If you want to use the released models, see the HuggingFace page (there is a usage example): https://huggingface.co/Transducens/xlm-roberta-base-parallel-urls-classifier. The released models in HuggingFace are not directly compatible with this code since it contains code ported from HuggingFace to implement multitasking, but if multitasking was not used, models can be manually converted to/from the HuggingFace version. This code should be used if you plan to train new models.
6+
7+
## Installation
8+
9+
To install PUC first clone the code from the repository:
10+
11+
```bash
12+
git clone https://github.com/transducens/parallel-urls-classifier.git
13+
```
14+
15+
Optionally, create a conda environment to isolate the python dependencies:
16+
17+
```bash
18+
conda create -n PUC -c conda-force python==3.8.5
19+
conda activate PUC
20+
```
21+
22+
Install PUC:
23+
24+
```bash
25+
cd parallel-urls-classifier
26+
27+
pip3 install .
28+
```
29+
30+
Check out the installation:
31+
32+
```bash
33+
parallel-urls-classifier --help
34+
```
35+
36+
## Usage
37+
38+
```
39+
usage: parallel-urls-classifier [-h] [--batch-size BATCH_SIZE]
40+
[--block-size BLOCK_SIZE]
41+
[--max-tokens MAX_TOKENS] [--epochs EPOCHS]
42+
[--do-not-fine-tune] [--freeze-whole-model]
43+
[--dataset-workers DATASET_WORKERS]
44+
[--pretrained-model PRETRAINED_MODEL]
45+
[--max-length-tokens MAX_LENGTH_TOKENS]
46+
[--model-input MODEL_INPUT]
47+
[--model-output MODEL_OUTPUT] [--inference]
48+
[--inference-from-stdin]
49+
[--inference-lang-using-url2lang]
50+
[--parallel-likelihood]
51+
[--threshold THRESHOLD]
52+
[--imbalanced-strategy {none,over-sampling,weighted-loss}]
53+
[--patience PATIENCE] [--train-until-patience]
54+
[--do-not-load-best-model]
55+
[--overwrite-output-model]
56+
[--remove-authority]
57+
[--remove-positional-data-from-resource]
58+
[--add-symmetric-samples] [--force-cpu]
59+
[--log-directory LOG_DIRECTORY] [--regression]
60+
[--url-separator URL_SEPARATOR]
61+
[--url-separator-new-token]
62+
[--learning-rate LEARNING_RATE]
63+
[--optimizer {none,adam,adamw,sgd}]
64+
[--optimizer-args beta1 beta2 eps weight_decay]
65+
[--lr-scheduler {none,linear,CLR,inverse_sqrt}]
66+
[--lr-scheduler-args warmup_steps]
67+
[--re-initialize-last-n-layers RE_INITIALIZE_LAST_N_LAYERS]
68+
[--cuda-amp] [--llrd]
69+
[--stringify-instead-of-tokenization]
70+
[--lowercase]
71+
[--auxiliary-tasks [{mlm,language-identification,langid-and-urls_classification} [{mlm,language-identification,langid-and-urls_classification} ...]]]
72+
[--auxiliary-tasks-weights [AUXILIARY_TASKS_WEIGHTS [AUXILIARY_TASKS_WEIGHTS ...]]]
73+
[--freeze-embeddings-layer]
74+
[--remove-instead-of-truncate]
75+
[--best-dev-metric {loss,Macro-F1,MCC}]
76+
[--task-dev-metric {urls_classification,language-identification,langid-and-urls_classification}]
77+
[--auxiliary-tasks-flags [{language-identification_add-solo-urls-too,language-identification_target-applies-only-to-trg-side,langid-and-urls_classification_reward-if-only-langid-is-correct-too} [{language-identification_add-solo-urls-too,language-identification_target-applies-only-to-trg-side,langid-and-urls_classification_reward-if-only-langid-is-correct-too} ...]]]
78+
[--do-not-train-main-task] [--pre-load-shards]
79+
[--seed SEED] [--plot] [--plot-path PLOT_PATH]
80+
[--lock-file LOCK_FILE]
81+
[--waiting-time WAITING_TIME] [-v]
82+
dataset_train_filename dataset_dev_filename
83+
dataset_test_filename
84+
85+
Parallel URLs classifier
86+
87+
positional arguments:
88+
dataset_train_filename
89+
Filename with train data (TSV format). You can provide
90+
multiple files separated using ':' and each of them
91+
will be used one for each epoch using round-robin
92+
strategy
93+
dataset_dev_filename Filename with dev data (TSV format)
94+
dataset_test_filename
95+
Filename with test data (TSV format)
96+
97+
optional arguments:
98+
-h, --help show this help message and exit
99+
--batch-size BATCH_SIZE
100+
Batch size. Elements which will be processed before
101+
proceed to train, but the whole batch will be
102+
processed in blocks in order to avoid OOM errors
103+
(default: 16)
104+
--block-size BLOCK_SIZE
105+
Block size. Elements which will be provided to the
106+
model at once (default: None)
107+
--max-tokens MAX_TOKENS
108+
Process batches in groups tokens size (fairseq style).
109+
Batch size is still relevant since the value is used
110+
when batches are needed (e.g. sampler from dataset)
111+
(default: -1)
112+
--epochs EPOCHS Epochs (default: 3)
113+
--do-not-fine-tune Do not apply fine-tuning to the base model (default
114+
weights) (default: False)
115+
--freeze-whole-model Do not apply fine-tuning to the whole model, not only
116+
the base model (default: False)
117+
--dataset-workers DATASET_WORKERS
118+
No. workers when loading the data in the dataset. When
119+
negative, all available CPUs will be used (default:
120+
-1)
121+
--pretrained-model PRETRAINED_MODEL
122+
Pretrained model (default: xlm-roberta-base)
123+
--max-length-tokens MAX_LENGTH_TOKENS
124+
Max. length for the generated tokens (default: 256)
125+
--model-input MODEL_INPUT
126+
Model input path which will be loaded (default: None)
127+
--model-output MODEL_OUTPUT
128+
Model output path where the model will be stored
129+
(default: None)
130+
--inference Do not train, just apply inference (flag --model-input
131+
is recommended). If this option is set, it will not be
132+
necessary to provide the input dataset (default:
133+
False)
134+
--inference-from-stdin
135+
Read inference from stdin (default: False)
136+
--inference-lang-using-url2lang
137+
When --inference is provided, if the language is
138+
necessary, url2lang will be used. The langs will be
139+
provided anyway to the model if needed, but the result
140+
will be ignored. The results of the language tasks
141+
will be either 1 or 0 if the lang matchs (default:
142+
False)
143+
--parallel-likelihood
144+
Print parallel likelihood instead of classification
145+
string (inference) (default: False)
146+
--threshold THRESHOLD
147+
Only print URLs which have a parallel likelihood
148+
greater than the provided threshold (inference)
149+
(default: -inf)
150+
--imbalanced-strategy {none,over-sampling,weighted-loss}
151+
Strategy for dealing with imbalanced data (default:
152+
none)
153+
--patience PATIENCE Patience before stopping the training (default: 0)
154+
--train-until-patience
155+
Train until patience value is reached (--epochs will
156+
be ignored in order to stop, but will still be used
157+
for other actions like LR scheduler) (default: False)
158+
--do-not-load-best-model
159+
Do not load best model for final dev and test
160+
evaluation (--model-output is necessary) (default:
161+
False)
162+
--overwrite-output-model
163+
Overwrite output model if it exists (initial loading)
164+
(default: False)
165+
--remove-authority Remove protocol and authority from provided URLs
166+
(default: False)
167+
--remove-positional-data-from-resource
168+
Remove content after '#' in the resorce (e.g.
169+
https://www.example.com/resource#position ->
170+
https://www.example.com/resource) (default: False)
171+
--add-symmetric-samples
172+
Add symmetric samples for training (if (src, trg) URL
173+
pair is provided, (trg, src) URL pair will be provided
174+
as well) (default: False)
175+
--force-cpu Run on CPU (i.e. do not check if GPU is possible)
176+
(default: False)
177+
--log-directory LOG_DIRECTORY
178+
Directory where different log files will be stored
179+
(default: None)
180+
--regression Apply regression instead of binary classification
181+
(default: False)
182+
--url-separator URL_SEPARATOR
183+
Separator to use when URLs are stringified (default:
184+
/)
185+
--url-separator-new-token
186+
Add special token for URL separator (default: False)
187+
--learning-rate LEARNING_RATE
188+
Learning rate (default: 1e-05)
189+
--optimizer {none,adam,adamw,sgd}
190+
Optimizer (default: adamw)
191+
--optimizer-args beta1 beta2 eps weight_decay
192+
Args. for the optimizer (in order to see the specific
193+
configuration for a optimizer, use -h and set
194+
--optimizer) (default: (0.9, 0.999, 1e-08, 0.01))
195+
--lr-scheduler {none,linear,CLR,inverse_sqrt}
196+
LR scheduler (default: inverse_sqrt)
197+
--lr-scheduler-args warmup_steps
198+
Args. for LR scheduler (in order to see the specific
199+
configuration for a LR scheduler, use -h and set --lr-
200+
scheduler) (default: ('10%',))
201+
--re-initialize-last-n-layers RE_INITIALIZE_LAST_N_LAYERS
202+
Re-initialize last N layers from pretained model (will
203+
be applied only when fine-tuning the model) (default:
204+
1)
205+
--cuda-amp Use CUDA AMP (Automatic Mixed Precision) (default:
206+
False)
207+
--llrd Apply LLRD (Layer-wise Learning Rate Decay) (default:
208+
False)
209+
--stringify-instead-of-tokenization
210+
Preprocess URLs applying custom stringify instead of
211+
tokenization (default: False)
212+
--lowercase Lowercase URLs while preprocessing (default: False)
213+
--auxiliary-tasks [{mlm,language-identification,langid-and-urls_classification} [{mlm,language-identification,langid-and-urls_classification} ...]]
214+
Tasks which will try to help to the main task
215+
(multitasking) (default: None)
216+
--auxiliary-tasks-weights [AUXILIARY_TASKS_WEIGHTS [AUXILIARY_TASKS_WEIGHTS ...]]
217+
Weights for the loss of the auxiliary tasks. If none
218+
is provided, the weights will be 1, but if any is
219+
provided, as many weights as auxiliary tasks will have
220+
to be provided (default: None)
221+
--freeze-embeddings-layer
222+
Freeze embeddings layer (default: False)
223+
--remove-instead-of-truncate
224+
Remove pairs of URLs which would need to be truncated
225+
(if not enabled, truncation will be applied). This
226+
option will be only applied to the training set
227+
(default: False)
228+
--best-dev-metric {loss,Macro-F1,MCC}
229+
Which metric should be maximized or minimized when dev
230+
is being evaluated in order to save the best model
231+
(default: Macro-F1)
232+
--task-dev-metric {urls_classification,language-identification,langid-and-urls_classification}
233+
Task which will be used in order to save the best
234+
model. It will also be used in order to replace the
235+
main task if --do-not-train-main-task is set (default:
236+
urls_classification)
237+
--auxiliary-tasks-flags [{language-identification_add-solo-urls-too,language-identification_target-applies-only-to-trg-side,langid-and-urls_classification_reward-if-only-langid-is-correct-too} [{language-identification_add-solo-urls-too,language-identification_target-applies-only-to-trg-side,langid-and-urls_classification_reward-if-only-langid-is-correct-too} ...]]
238+
Set of options which will set up some aspects of the
239+
auxiliary tasks (default: None)
240+
--do-not-train-main-task
241+
Main task (URLs classification) will not be trained.
242+
Auxiliary task will be needed (default: False)
243+
--pre-load-shards Load all shards at beginning one by one in order to
244+
get some statistics needed for some features. This
245+
option is optional, but if not set, some features
246+
might not work as expected (e.g. linear LR scheduler)
247+
(default: False)
248+
--seed SEED Seed in order to have deterministic results (not fully
249+
guaranteed). Set a negative number in order to disable
250+
this feature (default: 71213)
251+
--plot Plot statistics (matplotlib pyplot) in real time
252+
(default: False)
253+
--plot-path PLOT_PATH
254+
If set, the plot will be stored instead of displayed
255+
(default: None)
256+
--lock-file LOCK_FILE
257+
If set, and the file does not exist, it will be
258+
created once the training finishes. If does exist, the
259+
training will not be executed (default: None)
260+
--waiting-time WAITING_TIME
261+
Waiting time, if needed for letting the user react
262+
(default: 20)
263+
```
264+
265+
## CLI examples
266+
267+
Train a new model:
268+
269+
```bash
270+
parallel-urls-classifier /path/to/datasets/{train,dev,test}.tsv \
271+
--regression --epochs 10 --patience 5 --batch-size 140 --train-until-patience \
272+
--model-output /tmp/PUC_model
273+
```
274+
275+
Interactive inference:
276+
277+
```bash
278+
parallel-urls-classifier --regression --parallel-likelihood --inference \
279+
--model-input /tmp/PUC_model
280+
```
281+
282+
Inference using data from a file:
283+
284+
```bash
285+
cat /path/to/datasets/test.tsv \
286+
| parallel-urls-classifier --regression --parallel-likelihood --inference --inference-from-stdin \
287+
--model-input /tmp/PUC_model
288+
```
289+
290+
## Inference using Gunicorn
291+
292+
You can use different nodes to perform inference and to execute the model. For more information to execute the model, see:
293+
294+
```bash
295+
parallel-urls-classifier-server --help
296+
```
297+
298+
You may also want to look at the file `scripts/init_flask_server_with_gunicorn.sh` for a specific example to start the server.
299+
300+
The node performing the inference provides information to the node running the model. The information is provided through HTTP requests. Example if we run the model on `127.0.0.1`:
301+
302+
```bash
303+
curl http://127.0.0.1:5000/inference -X POST \
304+
-d "src_urls=https://domain/resource1&trg_urls=https://domain/resource2"
305+
```

0 commit comments

Comments
 (0)