-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Description
When using tune function to tune hyperparameters, I often end up with OOM errors, even though training models individually works fine. Perhaps related to #433 .
When trying to tune hyperparameters for a span_classifier model based on eds-camembert, I get OOM erros after a few trials. I have tried adding callbacks to free up the memory (using gc.collect, torch.cuda.empty_cache() and torch.cuda.ipc_collect()), but it seems to only partially work. When monitoring the GPU memory with nvidia-smi during the tuning, the memory goes up to around 10G during a training, then goes down to 500M between trials, except sometimes (seemingly randomly), the memory exceeds the 32G limit.
It has never happened on the first trial, and when I train a model using the train api with the parameters that were supposed to be used during the trial that crashed, everything works fine.
Maybe this is related to #433, and sometimes one trial isn't killed correctly and does not free the memory, leading to the OOM.
Your Environment
- Operating System:
- Python Version Used: 3.7.12
- spaCy Version Used: 2.2.4
- EDS-NLP Version Used: 0.17.2