Fix JIT trace/export training corruption #1012

nkemnitz · 2025-08-11T10:40:31Z

~~Passing the model to the JIT export subprocess will still share CUDA tensors. The trace can silently fail (see Warning) and modify the behavior of the model during training.~~

This PR:

Allows users to independently disable JIT and ONNX export
Fixes a few warnings related to (unnecessary) branching in our crop method (but that didn't solve the issues I encountered)
~~completely recreates the model inside the JIT export subprocess to avoid any side effects on the main process model - so slightly larger (GPU) memory footprint~~
still keeps the spawned process to avoid memory leaks
fixes a bug (I think it is) in which our exports were done in training mode
adds tests to verify the export functions are called when expected

codecov · 2025-08-11T11:49:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.95%. Comparing base (05256cf) to head (82607e8).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1012   +/-   ##
=======================================
  Coverage   99.95%   99.95%           
=======================================
  Files         179      180    +1     
  Lines        8497     8518   +21     
=======================================
+ Hits         8493     8514   +21     
  Misses          4        4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

supersergiy · 2025-08-18T14:59:32Z

Are memory leaks confirmed to be still an issue?

nkemnitz · 2025-08-18T18:27:55Z

The original issue is still open and their example still shows the same behavior.

trivoldus28 · 2025-08-19T00:54:52Z

lgtm, but maybe the default should be false for both exports, or enable onnx and disable JIT, we don't need both and I just tested the onnx with the code in the torch github issue and it doesn't leak memory (JIT trace still does with torch 2.6)

nkemnitz · 2025-08-19T08:11:36Z

I have no strong opinion on the defaults. For this PR I just replicated the current behavior

trivoldus28 · 2025-08-19T08:16:17Z

The current behavior is to always export ONNX? edit: Yea it does. I think there's no point exporting both, especially JIT. It's just slowing down training unnecessarily.

trivoldus28 · 2025-08-19T08:17:23Z

btw I took a second look at the warning linked and I don't think it means what you described (modify model during tracing).

nkemnitz · 2025-08-20T21:25:02Z

OK, after updating drivers and being forced to reinstall cudatoolkit, I can no longer replicate the exploding loss caused by JIT trace and/or export... so I reverted the deepcopy of the model including CUDA tensors. But at least added a bunch of tests to ensure checkpoint methods are correctly called / not called.

trivoldus28 · 2025-08-20T23:36:05Z

The model state was changing unexpectedly?

trivoldus28 · 2025-08-20T23:42:26Z

tests/unit/training/lightning/test_default_trainer.py

+def test_save_checkpoint_calls_exports_when_enabled(
+    trainer_mocks, mock_model, mock_trace_input, mock_lightning_module
+):
+    """Test that save_checkpoint calls export functions when exports are enabled."""


is this test necessary? isn't it simpler just to add an assert line in the code that checks at least 1 export function is enabled?

That's different. I am OK if the user disables both exports, so that assert would be too restricting, anyway.

Here I am just testing that the parameters got passed down the chain and didn't get lost while unpacking/modifying **kwargs somewhere

That's different. I am OK if the user disables both exports, so that assert would be too restricting, anyway.
Right, I misread the test.

But yea, still seems trivial? Is it really doing much more than a simple line coverage test? (i.e., call the save function & check that the saved file exists, etc.). Making sure the functions are called seems very marginally better.

trivoldus28 · 2025-08-20T23:43:21Z

tests/unit/training/lightning/test_default_trainer.py

+def test_save_checkpoint_skips_jit_when_disabled(
+    trainer_mocks, mock_model, mock_trace_input, mock_lightning_module
+):
+    """Test that save_checkpoint skips JIT export when disabled."""


seems not necessary. if jit.trace is disabled in the input param it's clear that it's disabled.

The test verifies that the export logic actually respects the set parameter.

Here is a Copy & Paste error introduced by future-Us doing refactoring:

if self.enable_onnx_export: jit_trace_export(...)

A simple line coverage test would have caught this?

OK, onnx_export might get called from a different location (either correctly or incorrectly), too. My point is: 100% line coverage is nice, but easy to achieve by cheating ourselves. Ideally I would like to ensure the method is called from all the expected code paths (and only those). 100% code coverage ensures no dead code, no broken code. But it does not ensure correct results.

But also, something is wrong here. I should not have been able to achieve 100% code coverage without these tests in the first place. Especially the error handling paths can't have been possibly covered before...

trivoldus28 · 2025-08-20T23:48:40Z

tests/unit/training/lightning/test_default_trainer.py

+
+
+def test_save_checkpoint_skips_exports_non_global_zero(trainer_mocks):
+    """Test that save_checkpoint skips exports when not global zero rank."""


this seems useful, but would miss additional exports in the future

nkemnitz · 2025-08-21T09:21:56Z

The model state was changing unexpectedly?

Yeah, the CUDA tensors are shared betweem processes (even after deepcopy) and somehow just the serialization/deserialization step of the model was modifying the model itself, causing training loss to suddenly explode in my main process...

trivoldus28 · 2025-08-22T01:13:05Z

Have you actually confirmed that the training state changing is the root cause of the issue though? Seems unlikely to me that some external libraries/drivers would be specific enough to torch to change that specific variable and cause a corruption.

I would actually guess that on a configuration with exploding losses, the test would pass and you'd still get that error.

trivoldus28 · 2025-08-22T01:13:45Z

@nkemnitz Can you also disable jit.trace export by default?

nkemnitz force-pushed the nkem/fix-jit-trace branch from d1b2c03 to 5f8a0b7 Compare August 11, 2025 11:38

nkemnitz force-pushed the nkem/fix-jit-trace branch 3 times, most recently from 2f006f2 to 743c32d Compare August 12, 2025 22:18

supersergiy approved these changes Aug 18, 2025

View reviewed changes

nkemnitz force-pushed the nkem/fix-jit-trace branch from 743c32d to 9a2f6ee Compare August 18, 2025 18:28

nkemnitz force-pushed the nkem/fix-jit-trace branch 2 times, most recently from a6974eb to d5e970d Compare August 20, 2025 21:22

nkemnitz requested review from supersergiy and trivoldus28 August 20, 2025 21:23

trivoldus28 reviewed Aug 20, 2025

View reviewed changes

nkemnitz added 5 commits December 1, 2025 11:03

refactor(crop_center): allow Size, avoid branching+asserts in trace mode

5e8d23d

fix(jit_export): Prevent trace side effects from corrupting training

6d3f9db

fix(jit_export): revert deepcopy+BytesIO workaround

7aa96c3

tests(default_trainer): tests for JIT/ONNX export calling behavior

0d0bc74

feat!(ZettaDefaultTrainer): Disabling JIT export by default

6387b83

tests(training/lightning): Enable coverage checks for export methods

82607e8

nkemnitz force-pushed the nkem/fix-jit-trace branch from 4132b0a to 82607e8 Compare December 1, 2025 10:05



		def test_save_checkpoint_skips_exports_non_global_zero(trainer_mocks):
		"""Test that save_checkpoint skips exports when not global zero rank."""

Fix JIT trace/export training corruption #1012

Are you sure you want to change the base?

Fix JIT trace/export training corruption #1012

Uh oh!

Conversation

nkemnitz commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

supersergiy commented Aug 18, 2025

Uh oh!

nkemnitz commented Aug 18, 2025

Uh oh!

trivoldus28 commented Aug 19, 2025

Uh oh!

nkemnitz commented Aug 19, 2025

Uh oh!

trivoldus28 commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivoldus28 commented Aug 19, 2025

Uh oh!

nkemnitz commented Aug 20, 2025

Uh oh!

trivoldus28 commented Aug 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nkemnitz commented Aug 21, 2025

Uh oh!

trivoldus28 commented Aug 22, 2025

Uh oh!

trivoldus28 commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nkemnitz commented Aug 11, 2025 •

edited

Loading

codecov bot commented Aug 11, 2025 •

edited

Loading

trivoldus28 commented Aug 19, 2025 •

edited

Loading