-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
I need to compute custom metrics during training. I first thought it would be as easy as adding my own metric function to some callback, but I couldn't find anything like this in the doc or in issues. I would be fine just having a callback when a new checkpoint is saved, or when the validation step is running.
Workaround
My current workaround is to run a second process that's constantly watching over the directory of models for any new checkpoint. When a new one is found, it executes my metrics calculation.
It works, but it's really not ideal for multiple reasons:
- The evaluation process is not synched with the training process, meaning if the training process is killed, the evaluation process might continue to run. This means more work is needed to also maintain that process lifetime.
- Because the evaluation process is run during the training steps, it cannot use the same GPU, meaning this process either needs its own dedicated GPU, or it needs to run on CPU. It would be much better to be able to run it on the same GPU, in between the training steps. This would also be faster because the model wouldn't need to be loaded in memory every time.
Question
How can I add a callback during the validation step or after a new checkpoint? If there is no current out-of-box solution, I would be fine doing a PR if you can give me some pointers to what would need to be changed.