-
Notifications
You must be signed in to change notification settings - Fork 30.2k
fix load_model_end = true work when save_steps < eval_steps #39560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Mind checking if this works @KrisTHL181 ? |
I ran a minimal reproduction with save_steps=3 and eval_steps=2, and observed that the checkpoint marked as best_model_checkpoint (e.g., checkpoint-6) was later deleted during checkpoint rotation, despite setting load_best_model_at_end=True. At the end of training, this caused a FileNotFoundError when attempting to load the best model. From real-time monitoring of the output directory, I confirmed that the checkpoint did exist temporarily but was removed by the cleanup process before training completed. The current logic in _sorted_checkpoints correctly identifies the best model and attempts to preserve it by adjusting the order of checkpoints. However, the method used — swapping elements iteratively toward the end — may not fully move the best checkpoint to the tail of the list, especially if it’s far from the end. As a result, when _rotate_checkpoints applies save_total_limit, the best model can still be included in the deletion list. This suggests that while the best model is tracked correctly, its protection is incomplete due to insufficient handling in _sorted_checkpoints. The interaction between save_total_limit and older, non-consecutive best checkpoints may lead to premature deletion — even when metrics confirm a best model was found. This behavior could result in failures during model recovery and warrants further investigation.
|
On it 🫡, thanks for the detailed review ❤️ |
@KrisTHL181 please can you share the reproduction |
Great fix! It looks like the best_model_checkpoint is indeed protected! Here's the code: import os
import tempfile
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
Trainer,
TrainingArguments,
set_seed
)
set_seed(42)
dataset = load_dataset("imdb", split="train[:100]").shuffle(seed=42)
dataset = dataset.map(lambda x: {"label": int(x["label"])}, batched=False)
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=64)
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
training_args = TrainingArguments(
output_dir="./pr-test",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
save_steps=3, # misaligned |
eval_steps=2, # misaligned |
save_total_limit=2, # only keep 2 model
load_best_model_at_end=True,
metric_for_best_model="loss",
greater_is_better=True, # NOTE: ONLY FOR TEST!
eval_strategy="steps",
logging_dir=os.path.join("pr-test", "logs"),
logging_steps=1,
report_to="none",
preserve_best_model=True
)
# We intentionally set greater_is_better=True with 'loss' to force
# an early checkpoint to be marked as best_model_checkpoint.
# This tests whether the Trainer correctly preserves it even when save_total_limit
# would otherwise delete older checkpoints.
import numpy as np
from sklearn.metrics import accuracy_score
def compute_metrics(pred):
logits, labels = pred
predictions = np.argmax(logits, axis=-1)
return {"accuracy": accuracy_score(labels, predictions)}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets,
eval_dataset=tokenized_datasets,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
# check
best_ckpt = trainer.state.best_model_checkpoint
assert best_ckpt is not None, "[FAIL] best_ckpt is None"
print(f"best_ckpt = {best_ckpt}")
assert os.path.exists(best_ckpt), f"[FAIL] best_ckpt folder not found" To illustrate this, here's what happens in practice when eval_steps=2 and save_steps=3: ┌──(kris㉿KrisTHL181)-[/media/kris/SwapTemp/transformers]
└─$ head pr-test/checkpoint-13/trainer_state.json
{
"best_global_step": 2,
"best_metric": 0.4929685592651367,
- "best_model_checkpoint": null,
+ // No checkpoint saved at step 2 → best model lost
"epoch": 1.0,
"eval_steps": 2,
"global_step": 13,
"is_hyper_param_search": false,
"is_local_process_zero": true,
"is_world_process_zero": true,
┌──(kris㉿KrisTHL181)-[/media/kris/SwapTemp/transformers]
└─$ ls pr-test
checkpoint-12 checkpoint-13 |
Quick update: I manually created checkpoint-2 (copied from checkpoint-12) and set best_model_checkpoint to point to it. |
@ved1beta This looks like it solves the issue! It's working as expected :D ┌──(kris㉿KrisTHL181)-[/media/kris/SwapTemp/transformers/pr-test]
└─$ ls -1
checkpoint-13
+ checkpoint-2
┌──(kris㉿KrisTHL181)-[/media/kris/SwapTemp/transformers/pr-test]
└─$ head checkpoint-13/trainer_state.json
{
"best_global_step": 2,
"best_metric": 0.4929685592651367,
+ "best_model_checkpoint": "./pr-test/checkpoint-2",
"epoch": 1.0,
"eval_steps": 2,
"global_step": 13,
"is_hyper_param_search": false,
"is_local_process_zero": true,
"is_world_process_zero": true, |
What does this PR do?
model checkpoint tracking and rotation in the Trainer when save_steps and eval_steps are misaligned, addressing issue #39476
Before submitting
@SunMarc