|
1 | | -# Parallel URLs classifier |
| 1 | +# Parallel URLs Classifier |
| 2 | + |
| 3 | +`parallel-urls-classifier` (PUC) is a tool implemented in Python that allows to infer whether a pair of URLs link to parallel documents (i.e., documents with the same content but written in different languages). You can either get a textual description `positive`/`negative` or the probability that the URL pair links to parallel documents. |
| 4 | + |
| 5 | +The code provided in this repo allows to train new models. If you want to use the released models, see the HuggingFace page (there is a usage example): https://huggingface.co/Transducens/xlm-roberta-base-parallel-urls-classifier. The released models in HuggingFace are not directly compatible with this code since it contains code ported from HuggingFace to implement multitasking, but if multitasking was not used, models can be manually converted to/from the HuggingFace version. This code should be used if you plan to train new models. |
| 6 | + |
| 7 | +## Installation |
| 8 | + |
| 9 | +To install PUC first clone the code from the repository: |
| 10 | + |
| 11 | +```bash |
| 12 | +git clone https://github.com/transducens/parallel-urls-classifier.git |
| 13 | +``` |
| 14 | + |
| 15 | +Optionally, create a conda environment to isolate the python dependencies: |
| 16 | + |
| 17 | +```bash |
| 18 | +conda create -n PUC -c conda-force python==3.8.5 |
| 19 | +conda activate PUC |
| 20 | +``` |
| 21 | + |
| 22 | +Install PUC: |
| 23 | + |
| 24 | +```bash |
| 25 | +cd parallel-urls-classifier |
| 26 | + |
| 27 | +pip3 install . |
| 28 | +``` |
| 29 | + |
| 30 | +Check out the installation: |
| 31 | + |
| 32 | +```bash |
| 33 | +parallel-urls-classifier --help |
| 34 | +``` |
| 35 | + |
| 36 | +## Usage |
| 37 | + |
| 38 | +``` |
| 39 | +usage: parallel-urls-classifier [-h] [--batch-size BATCH_SIZE] |
| 40 | + [--block-size BLOCK_SIZE] |
| 41 | + [--max-tokens MAX_TOKENS] [--epochs EPOCHS] |
| 42 | + [--do-not-fine-tune] [--freeze-whole-model] |
| 43 | + [--dataset-workers DATASET_WORKERS] |
| 44 | + [--pretrained-model PRETRAINED_MODEL] |
| 45 | + [--max-length-tokens MAX_LENGTH_TOKENS] |
| 46 | + [--model-input MODEL_INPUT] |
| 47 | + [--model-output MODEL_OUTPUT] [--inference] |
| 48 | + [--inference-from-stdin] |
| 49 | + [--inference-lang-using-url2lang] |
| 50 | + [--parallel-likelihood] |
| 51 | + [--threshold THRESHOLD] |
| 52 | + [--imbalanced-strategy {none,over-sampling,weighted-loss}] |
| 53 | + [--patience PATIENCE] [--train-until-patience] |
| 54 | + [--do-not-load-best-model] |
| 55 | + [--overwrite-output-model] |
| 56 | + [--remove-authority] |
| 57 | + [--remove-positional-data-from-resource] |
| 58 | + [--add-symmetric-samples] [--force-cpu] |
| 59 | + [--log-directory LOG_DIRECTORY] [--regression] |
| 60 | + [--url-separator URL_SEPARATOR] |
| 61 | + [--url-separator-new-token] |
| 62 | + [--learning-rate LEARNING_RATE] |
| 63 | + [--optimizer {none,adam,adamw,sgd}] |
| 64 | + [--optimizer-args beta1 beta2 eps weight_decay] |
| 65 | + [--lr-scheduler {none,linear,CLR,inverse_sqrt}] |
| 66 | + [--lr-scheduler-args warmup_steps] |
| 67 | + [--re-initialize-last-n-layers RE_INITIALIZE_LAST_N_LAYERS] |
| 68 | + [--cuda-amp] [--llrd] |
| 69 | + [--stringify-instead-of-tokenization] |
| 70 | + [--lowercase] |
| 71 | + [--auxiliary-tasks [{mlm,language-identification,langid-and-urls_classification} [{mlm,language-identification,langid-and-urls_classification} ...]]] |
| 72 | + [--auxiliary-tasks-weights [AUXILIARY_TASKS_WEIGHTS [AUXILIARY_TASKS_WEIGHTS ...]]] |
| 73 | + [--freeze-embeddings-layer] |
| 74 | + [--remove-instead-of-truncate] |
| 75 | + [--best-dev-metric {loss,Macro-F1,MCC}] |
| 76 | + [--task-dev-metric {urls_classification,language-identification,langid-and-urls_classification}] |
| 77 | + [--auxiliary-tasks-flags [{language-identification_add-solo-urls-too,language-identification_target-applies-only-to-trg-side,langid-and-urls_classification_reward-if-only-langid-is-correct-too} [{language-identification_add-solo-urls-too,language-identification_target-applies-only-to-trg-side,langid-and-urls_classification_reward-if-only-langid-is-correct-too} ...]]] |
| 78 | + [--do-not-train-main-task] [--pre-load-shards] |
| 79 | + [--seed SEED] [--plot] [--plot-path PLOT_PATH] |
| 80 | + [--lock-file LOCK_FILE] |
| 81 | + [--waiting-time WAITING_TIME] [-v] |
| 82 | + dataset_train_filename dataset_dev_filename |
| 83 | + dataset_test_filename |
| 84 | +
|
| 85 | +Parallel URLs classifier |
| 86 | +
|
| 87 | +positional arguments: |
| 88 | + dataset_train_filename |
| 89 | + Filename with train data (TSV format). You can provide |
| 90 | + multiple files separated using ':' and each of them |
| 91 | + will be used one for each epoch using round-robin |
| 92 | + strategy |
| 93 | + dataset_dev_filename Filename with dev data (TSV format) |
| 94 | + dataset_test_filename |
| 95 | + Filename with test data (TSV format) |
| 96 | +
|
| 97 | +optional arguments: |
| 98 | + -h, --help show this help message and exit |
| 99 | + --batch-size BATCH_SIZE |
| 100 | + Batch size. Elements which will be processed before |
| 101 | + proceed to train, but the whole batch will be |
| 102 | + processed in blocks in order to avoid OOM errors |
| 103 | + (default: 16) |
| 104 | + --block-size BLOCK_SIZE |
| 105 | + Block size. Elements which will be provided to the |
| 106 | + model at once (default: None) |
| 107 | + --max-tokens MAX_TOKENS |
| 108 | + Process batches in groups tokens size (fairseq style). |
| 109 | + Batch size is still relevant since the value is used |
| 110 | + when batches are needed (e.g. sampler from dataset) |
| 111 | + (default: -1) |
| 112 | + --epochs EPOCHS Epochs (default: 3) |
| 113 | + --do-not-fine-tune Do not apply fine-tuning to the base model (default |
| 114 | + weights) (default: False) |
| 115 | + --freeze-whole-model Do not apply fine-tuning to the whole model, not only |
| 116 | + the base model (default: False) |
| 117 | + --dataset-workers DATASET_WORKERS |
| 118 | + No. workers when loading the data in the dataset. When |
| 119 | + negative, all available CPUs will be used (default: |
| 120 | + -1) |
| 121 | + --pretrained-model PRETRAINED_MODEL |
| 122 | + Pretrained model (default: xlm-roberta-base) |
| 123 | + --max-length-tokens MAX_LENGTH_TOKENS |
| 124 | + Max. length for the generated tokens (default: 256) |
| 125 | + --model-input MODEL_INPUT |
| 126 | + Model input path which will be loaded (default: None) |
| 127 | + --model-output MODEL_OUTPUT |
| 128 | + Model output path where the model will be stored |
| 129 | + (default: None) |
| 130 | + --inference Do not train, just apply inference (flag --model-input |
| 131 | + is recommended). If this option is set, it will not be |
| 132 | + necessary to provide the input dataset (default: |
| 133 | + False) |
| 134 | + --inference-from-stdin |
| 135 | + Read inference from stdin (default: False) |
| 136 | + --inference-lang-using-url2lang |
| 137 | + When --inference is provided, if the language is |
| 138 | + necessary, url2lang will be used. The langs will be |
| 139 | + provided anyway to the model if needed, but the result |
| 140 | + will be ignored. The results of the language tasks |
| 141 | + will be either 1 or 0 if the lang matchs (default: |
| 142 | + False) |
| 143 | + --parallel-likelihood |
| 144 | + Print parallel likelihood instead of classification |
| 145 | + string (inference) (default: False) |
| 146 | + --threshold THRESHOLD |
| 147 | + Only print URLs which have a parallel likelihood |
| 148 | + greater than the provided threshold (inference) |
| 149 | + (default: -inf) |
| 150 | + --imbalanced-strategy {none,over-sampling,weighted-loss} |
| 151 | + Strategy for dealing with imbalanced data (default: |
| 152 | + none) |
| 153 | + --patience PATIENCE Patience before stopping the training (default: 0) |
| 154 | + --train-until-patience |
| 155 | + Train until patience value is reached (--epochs will |
| 156 | + be ignored in order to stop, but will still be used |
| 157 | + for other actions like LR scheduler) (default: False) |
| 158 | + --do-not-load-best-model |
| 159 | + Do not load best model for final dev and test |
| 160 | + evaluation (--model-output is necessary) (default: |
| 161 | + False) |
| 162 | + --overwrite-output-model |
| 163 | + Overwrite output model if it exists (initial loading) |
| 164 | + (default: False) |
| 165 | + --remove-authority Remove protocol and authority from provided URLs |
| 166 | + (default: False) |
| 167 | + --remove-positional-data-from-resource |
| 168 | + Remove content after '#' in the resorce (e.g. |
| 169 | + https://www.example.com/resource#position -> |
| 170 | + https://www.example.com/resource) (default: False) |
| 171 | + --add-symmetric-samples |
| 172 | + Add symmetric samples for training (if (src, trg) URL |
| 173 | + pair is provided, (trg, src) URL pair will be provided |
| 174 | + as well) (default: False) |
| 175 | + --force-cpu Run on CPU (i.e. do not check if GPU is possible) |
| 176 | + (default: False) |
| 177 | + --log-directory LOG_DIRECTORY |
| 178 | + Directory where different log files will be stored |
| 179 | + (default: None) |
| 180 | + --regression Apply regression instead of binary classification |
| 181 | + (default: False) |
| 182 | + --url-separator URL_SEPARATOR |
| 183 | + Separator to use when URLs are stringified (default: |
| 184 | + /) |
| 185 | + --url-separator-new-token |
| 186 | + Add special token for URL separator (default: False) |
| 187 | + --learning-rate LEARNING_RATE |
| 188 | + Learning rate (default: 1e-05) |
| 189 | + --optimizer {none,adam,adamw,sgd} |
| 190 | + Optimizer (default: adamw) |
| 191 | + --optimizer-args beta1 beta2 eps weight_decay |
| 192 | + Args. for the optimizer (in order to see the specific |
| 193 | + configuration for a optimizer, use -h and set |
| 194 | + --optimizer) (default: (0.9, 0.999, 1e-08, 0.01)) |
| 195 | + --lr-scheduler {none,linear,CLR,inverse_sqrt} |
| 196 | + LR scheduler (default: inverse_sqrt) |
| 197 | + --lr-scheduler-args warmup_steps |
| 198 | + Args. for LR scheduler (in order to see the specific |
| 199 | + configuration for a LR scheduler, use -h and set --lr- |
| 200 | + scheduler) (default: ('10%',)) |
| 201 | + --re-initialize-last-n-layers RE_INITIALIZE_LAST_N_LAYERS |
| 202 | + Re-initialize last N layers from pretained model (will |
| 203 | + be applied only when fine-tuning the model) (default: |
| 204 | + 1) |
| 205 | + --cuda-amp Use CUDA AMP (Automatic Mixed Precision) (default: |
| 206 | + False) |
| 207 | + --llrd Apply LLRD (Layer-wise Learning Rate Decay) (default: |
| 208 | + False) |
| 209 | + --stringify-instead-of-tokenization |
| 210 | + Preprocess URLs applying custom stringify instead of |
| 211 | + tokenization (default: False) |
| 212 | + --lowercase Lowercase URLs while preprocessing (default: False) |
| 213 | + --auxiliary-tasks [{mlm,language-identification,langid-and-urls_classification} [{mlm,language-identification,langid-and-urls_classification} ...]] |
| 214 | + Tasks which will try to help to the main task |
| 215 | + (multitasking) (default: None) |
| 216 | + --auxiliary-tasks-weights [AUXILIARY_TASKS_WEIGHTS [AUXILIARY_TASKS_WEIGHTS ...]] |
| 217 | + Weights for the loss of the auxiliary tasks. If none |
| 218 | + is provided, the weights will be 1, but if any is |
| 219 | + provided, as many weights as auxiliary tasks will have |
| 220 | + to be provided (default: None) |
| 221 | + --freeze-embeddings-layer |
| 222 | + Freeze embeddings layer (default: False) |
| 223 | + --remove-instead-of-truncate |
| 224 | + Remove pairs of URLs which would need to be truncated |
| 225 | + (if not enabled, truncation will be applied). This |
| 226 | + option will be only applied to the training set |
| 227 | + (default: False) |
| 228 | + --best-dev-metric {loss,Macro-F1,MCC} |
| 229 | + Which metric should be maximized or minimized when dev |
| 230 | + is being evaluated in order to save the best model |
| 231 | + (default: Macro-F1) |
| 232 | + --task-dev-metric {urls_classification,language-identification,langid-and-urls_classification} |
| 233 | + Task which will be used in order to save the best |
| 234 | + model. It will also be used in order to replace the |
| 235 | + main task if --do-not-train-main-task is set (default: |
| 236 | + urls_classification) |
| 237 | + --auxiliary-tasks-flags [{language-identification_add-solo-urls-too,language-identification_target-applies-only-to-trg-side,langid-and-urls_classification_reward-if-only-langid-is-correct-too} [{language-identification_add-solo-urls-too,language-identification_target-applies-only-to-trg-side,langid-and-urls_classification_reward-if-only-langid-is-correct-too} ...]] |
| 238 | + Set of options which will set up some aspects of the |
| 239 | + auxiliary tasks (default: None) |
| 240 | + --do-not-train-main-task |
| 241 | + Main task (URLs classification) will not be trained. |
| 242 | + Auxiliary task will be needed (default: False) |
| 243 | + --pre-load-shards Load all shards at beginning one by one in order to |
| 244 | + get some statistics needed for some features. This |
| 245 | + option is optional, but if not set, some features |
| 246 | + might not work as expected (e.g. linear LR scheduler) |
| 247 | + (default: False) |
| 248 | + --seed SEED Seed in order to have deterministic results (not fully |
| 249 | + guaranteed). Set a negative number in order to disable |
| 250 | + this feature (default: 71213) |
| 251 | + --plot Plot statistics (matplotlib pyplot) in real time |
| 252 | + (default: False) |
| 253 | + --plot-path PLOT_PATH |
| 254 | + If set, the plot will be stored instead of displayed |
| 255 | + (default: None) |
| 256 | + --lock-file LOCK_FILE |
| 257 | + If set, and the file does not exist, it will be |
| 258 | + created once the training finishes. If does exist, the |
| 259 | + training will not be executed (default: None) |
| 260 | + --waiting-time WAITING_TIME |
| 261 | + Waiting time, if needed for letting the user react |
| 262 | + (default: 20) |
| 263 | +``` |
| 264 | + |
| 265 | +## CLI examples |
| 266 | + |
| 267 | +Train a new model: |
| 268 | + |
| 269 | +```bash |
| 270 | +parallel-urls-classifier /path/to/datasets/{train,dev,test}.tsv \ |
| 271 | + --regression --epochs 10 --patience 5 --batch-size 140 --train-until-patience \ |
| 272 | + --model-output /tmp/PUC_model |
| 273 | +``` |
| 274 | + |
| 275 | +Interactive inference: |
| 276 | + |
| 277 | +```bash |
| 278 | +parallel-urls-classifier --regression --parallel-likelihood --inference \ |
| 279 | + --model-input /tmp/PUC_model |
| 280 | +``` |
| 281 | + |
| 282 | +Inference using data from a file: |
| 283 | + |
| 284 | +```bash |
| 285 | +cat /path/to/datasets/test.tsv \ |
| 286 | + | parallel-urls-classifier --regression --parallel-likelihood --inference --inference-from-stdin \ |
| 287 | + --model-input /tmp/PUC_model |
| 288 | +``` |
| 289 | + |
| 290 | +## Inference using Gunicorn |
| 291 | + |
| 292 | +You can use different nodes to perform inference and to execute the model. For more information to execute the model, see: |
| 293 | + |
| 294 | +```bash |
| 295 | +parallel-urls-classifier-server --help |
| 296 | +``` |
| 297 | + |
| 298 | +You may also want to look at the file `scripts/init_flask_server_with_gunicorn.sh` for a specific example to start the server. |
| 299 | + |
| 300 | +The node performing the inference provides information to the node running the model. The information is provided through HTTP requests. Example if we run the model on `127.0.0.1`: |
| 301 | + |
| 302 | +```bash |
| 303 | +curl http://127.0.0.1:5000/inference -X POST \ |
| 304 | + -d "src_urls=https://domain/resource1&trg_urls=https://domain/resource2" |
| 305 | +``` |
0 commit comments