Skip to content

Custom callbacks for metrics, saving checkpoints #2575

@Garfounkel

Description

@Garfounkel

I need to compute custom metrics during training. I first thought it would be as easy as adding my own metric function to some callback, but I couldn't find anything like this in the doc or in issues. I would be fine just having a callback when a new checkpoint is saved, or when the validation step is running.

Workaround

My current workaround is to run a second process that's constantly watching over the directory of models for any new checkpoint. When a new one is found, it executes my metrics calculation.

It works, but it's really not ideal for multiple reasons:

  • The evaluation process is not synched with the training process, meaning if the training process is killed, the evaluation process might continue to run. This means more work is needed to also maintain that process lifetime.
  • Because the evaluation process is run during the training steps, it cannot use the same GPU, meaning this process either needs its own dedicated GPU, or it needs to run on CPU. It would be much better to be able to run it on the same GPU, in between the training steps. This would also be faster because the model wouldn't need to be loaded in memory every time.

Question

How can I add a callback during the validation step or after a new checkpoint? If there is no current out-of-box solution, I would be fine doing a PR if you can give me some pointers to what would need to be changed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions