Skip to content

Out of Memory Error when tuning hyperparameters #434

@theoimbert-aphp

Description

@theoimbert-aphp

Description

When using tune function to tune hyperparameters, I often end up with OOM errors, even though training models individually works fine. Perhaps related to #433 .

When trying to tune hyperparameters for a span_classifier model based on eds-camembert, I get OOM erros after a few trials. I have tried adding callbacks to free up the memory (using gc.collect, torch.cuda.empty_cache() and torch.cuda.ipc_collect()), but it seems to only partially work. When monitoring the GPU memory with nvidia-smi during the tuning, the memory goes up to around 10G during a training, then goes down to 500M between trials, except sometimes (seemingly randomly), the memory exceeds the 32G limit.

It has never happened on the first trial, and when I train a model using the train api with the parameters that were supposed to be used during the trial that crashed, everything works fine.

Maybe this is related to #433, and sometimes one trial isn't killed correctly and does not free the memory, leading to the OOM.

Your Environment

  • Operating System:
  • Python Version Used: 3.7.12
  • spaCy Version Used: 2.2.4
  • EDS-NLP Version Used: 0.17.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions