Skip to content

Conversation

@danielzuegner
Copy link
Contributor

@danielzuegner danielzuegner commented Nov 25, 2025

Summary

  • Adds functionality to read the latest step per trajectory and trajectory reporter to set the initial step accordingly.
  • Adds tests test_optimize_append_to_trajectory and test_integrate_append_to_trajectory to ensure trajectories are being appended to.
  • Also fixes a bug where, whenever a structure has converged, all other trajectories are reset by re-opening them in "w" mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.

Closes #356.

Checklist

Before a pull request can be merged, the following items must be checked:

  • Doc strings have been added in the Google docstring format.
  • Run ruff on your code.
  • Tests have been added for any new functionality or bug fixes.

@orionarcher orionarcher self-requested a review November 25, 2025 17:23
Copy link
Collaborator

@orionarcher orionarcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixes a bug where, whenever a structure has converged, all other trajectories are reset by re-opening them in "w" mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.

Can you say more here? This would be a pretty serious bug because it would mean ll the optimization trajectories are wrong.

More generally, I think this PR needs to grapple with the fact that if we are appending, the trajectories we are appending to might have different past progress. In a batched setting, we can't just take the maximum progress of one as the progress of all. I realize that makes literally everything more annoying but such is the curse and blessing of TorchSim.

@danielzuegner
Copy link
Contributor Author

Also fixes a bug where, whenever a structure has converged, all other trajectories are reset by re-opening them in "w" mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.

Can you say more here? This would be a pretty serious bug because it would mean ll the optimization trajectories are wrong.

More generally, I think this PR needs to grapple with the fact that if we are appending, the trajectories we are appending to might have different past progress. In a batched setting, we can't just take the maximum progress of one as the progress of all. I realize that makes literally everything more annoying but such is the curse and blessing of TorchSim.

Unfortunately I believe this affects all optimization trajectories. Looking at load_new_trajectories, we can see that the trajectories are being instantiated from scratch with self.trajectory_kwargs, i.e., write mode by default. I've noticed that when I was debugging my PR that the longer trajectories are chopped at the start when another one finishes. I'd have steps [10, 15, 20, ...] in the trajectory instead of [5, 10, ...]. Happy to be proven wrong, but I think this is a bug that potentially affects all optimization trajectories currently.

@thomasloux
Copy link
Collaborator

I would rather push for a PR allowing for a restart of simulation #43. In practice the difference is that supposed if you run multiple time ts.integrate(n_step=1000), the first call will run for longer but for the rest it will finish directly. You could add an argument to explicitly run for longer without reading yourself the trajectory.
Note: for a real restart or run for longer, this doesn't change the fact currently the TrajectoryReporter does only save SimState, and not any more variables of subclass. For instance, if you run a NVT Nosé Hoover MD, this means you loose the state of the thermostat. So even run for longer is not technically right. This would be true as well for a (near future) LBFGS optimization.

@danielzuegner
Copy link
Contributor Author

I would rather push for a PR allowing for a restart of simulation #43. In practice the difference is that supposed if you run multiple time ts.integrate(n_step=1000), the first call will run for longer but for the rest it will finish directly. You could add an argument to explicitly run for longer without reading yourself the trajectory. Note: for a real restart or run for longer, this doesn't change the fact currently the TrajectoryReporter does only save SimState, and not any more variables of subclass. For instance, if you run a NVT Nosé Hoover MD, this means you loose the state of the thermostat. So even run for longer is not technically right. This would be true as well for a (near future) LBFGS optimization.

@thomasloux happy to change it to this way of running. Just to make sure I understand. The proposal is to:

  • Detect the last time step per trajectory (say, [5, 10] for a toy example of two systems).
  • If n_steps=15, we run 10 more steps for the first one and 5 more steps for the second trajectory.

If that's the case, I think we have to do it like in the optimize() case, where convergence_tensor would be the tensor of bools whether each trajectory has reached its terminal step yet. Is that understanding correct?

@thomasloux
Copy link
Collaborator

For a MD simulations at least, there is no such thing, all systems evolve at the same time. I actually assumes it the same for optimize, if not for convergence criteria that would be met.
So in case of ts.integrate, you would get n_step from trajectory, and you would run a for loop like range(n_step_traj, n_step_total)instead of the current PR suggestion range(n_initial, n_initial + n_step) and then having to manually remove n_initial when you want to access kT values for instance.

@thomasloux
Copy link
Collaborator

thomasloux commented Nov 26, 2025

I don't have any strict opinion on optimization as I don't use it this much

@danielzuegner
Copy link
Contributor Author

For a MD simulations at least, there is no such thing, all systems evolve at the same time. I actually assumes it the same for optimize, if not for convergence criteria that would be met. So in case of ts.integrate, you would get n_step from trajectory, and you would run a for loop like range(n_step_traj, n_step_total)instead of the current PR suggestion range(n_initial, n_initial + n_step) and then having to manually remove n_initial when you want to access kT values for instance.

Thanks @thomasloux! I've updated the code to assume that for integrate() all trajectories start from the same time step, and then only integrate for n_steps - initial_step steps. Is that in line with what you had in mind?

Copy link
Collaborator

@thomasloux thomasloux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I wanted for restart and ts.integrate ! Just waiting for the tests to pass to approve

@danielzuegner
Copy link
Contributor Author

Will take a look at the failures and hopefully fix them

@danielzuegner
Copy link
Contributor Author

Tests pass locally for me now, let's see what happens in the CI.

Daniel Zuegner added 2 commits November 27, 2025 13:45
Copy link
Collaborator

@thomasloux thomasloux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One part to clarify for ts.optimize, otherwise it's a really nice work. I think in a later PR we should clarify the Trajectory objects. I'm not a fan of the mode 'r' vs 'a'. I would probably prefer an argument: either 'restart=True' or 'overwrite_if_exists=True'.

Copy link
Collaborator

@thomasloux thomasloux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good for me now

Copy link
Collaborator

@orionarcher orionarcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code to assume that for integrate() all trajectories start from the same time step, and then only integrate for n_steps - initial_step steps. Is that in line with what you had in mind?

Sorry for being a bit late on this, but I am going to push back on this implementation.

I think that n_steps should mean the number of steps you take, not the number of steps you reach. My heavy prior would be that n_steps is a number of steps taken, as a user, I'd find it quite strange if that wasn't the case. What do you think?

I also don't think that if the steps don't match that we should truncate all the trajectories to the same length. In my mind, we should detect the current step across all the trajectories, then progressively append to that step + i, where i is the number of steps taken by the integrator. Does that make sense?

I want to say I am quite appreciative of the PR and I think being able to append with the runners is a good feature. I am being nit picky because I think it's challenging to get the right behavior and expectations for this in a batched context.

@thomasloux
Copy link
Collaborator

thomasloux commented Nov 29, 2025

Hey Orion, I think that both implementations make sense. I see two possibilities:

  • The user has a mode="a" in mind, so it makes sense to have n_steps to be the n_steps to add and not the total number of steps in trajectory.
  • The user wants to restart a trajectory (think of a failure or a slurm job that finished too early). Then I think it makes sense to run the same code and the that TorchSim can detect the traj file to start from it.

I think the 2nd option is more appealing then the first one. Also we could easily obtained the first behaviour by adding an argument to the functions.

The truncation is also a direct result of the fact I pushed for a restart behaviour. At the moment a MD simulation from ts.integrate should always be in sync for all states in batch, so it does not make sense to allow different indices in this PR. We could change that in another PR though

@orionarcher
Copy link
Collaborator

orionarcher commented Nov 29, 2025

Good framing. Both use cases are valid so perhaps it's worth considering the left out group.

Finishing the simulation if n_steps = additional steps

total_steps = 1000
steps_completed_before_interrupt = my_trajectory_reporter.last_step  # using the last_step property introduced in this PR
n_steps = 1000 - steps_completed_before_interupt
integrate(...n_steps=n_steps)

Appending 20 steps if n_steps = total steps

new_steps = 1000
current_number_of_steps = my_trajectory_reporter.last_step
n_steps = current_number_of_steps - new_steps
integrate(...n_steps=n_steps)

Either is still accessible no matter what default is adopted and both are about the same amount of code. Ultimately, it comes down to which is more expected and which is more common. I am not sure which is more common, it very well might be restarting, but I feel that appending steps is more expected of an n_steps argument.

At the moment a MD simulation from ts.integrate should always be in sync for all states in batch, so it does not make sense to allow different indices in this PR. We could change that in another PR though

Good point that separate indices are a distinct feature and could go elsewhere. This is a potentially dangerous default though. Imagine I have 100 simulations and I can fit 10 on a GPU at a time. My simulation proceeds nicely through the first 90 and then only makes it 1 step into the the final 10. If I rerun the same script expecting to restart, I'll erase the progress on the first 90 simulations because the smallest last_step will be 1. Instead of truncating, perhaps we could enforce that all trajectories have the same last_step? That feels safer to me.

The user wants to restart a trajectory (think of a failure or a slurm job that finished too early). Then I think it makes sense to run the same code and the that TorchSim can detect the traj file to start from it.

This is going to be difficult when some simulations will run to completion and others will not begin. I think some modifications will need to be made to the script to weed out already started PRs.

OK, as I outline these challenges and think about it I am realizing that it's pretty hard to restart simulations in TorchSim right now and my proposal would keep it hard. Perhaps we can add some flag like restart=False in the kwargs that will nicely handle the complexities outlined above and allow a script to be rerun without worrying about any of this. That would require a bit of logic added to what we have now. I have to run right now but will think on this further.

@orionarcher
Copy link
Collaborator

In the case of an interupted simulation it's likely that there will be some files that are complete, some that are incomplete, and some that have not started. We need a way to resume simulations that balances ease, sharp edges, and encouraging best practices.

Here is what I propose:

  1. Require that all files in the TrajectoryReporter all have the same number of steps, either 0 (do not exist yet) or N. If a simulation is interrupted, it becomes the responsibility of the user to identify which systems have completed and discard those from the file list. A line in the script that discards completed files can always be included and will just do nothing if it's the first run.

  2. Provide a utility for resetting many systems to the same step. This can be called externally when needed, but I don't think it should be applied blindly internally, lest we wipe lots of progress. For the partially complete files these can be wiped to the same step and n_steps can be adjusted to account for their length.

  3. n_steps should be the number of new steps not total. I looked at OpenMM, LAMMPS, and ASE and even when appending n_steps is the number of new steps, not the total number.

Does this workflow feel too complicated? Let me know what y'all think.

@danielzuegner
Copy link
Contributor Author

In the case of an interupted simulation it's likely that there will be some files that are complete, some that are incomplete, and some that have not started. We need a way to resume simulations that balances ease, sharp edges, and encouraging best practices.

Here is what I propose:

  1. Require that all files in the TrajectoryReporter all have the same number of steps, either 0 (do not exist yet) or N. If a simulation is interrupted, it becomes the responsibility of the user to identify which systems have completed and discard those from the file list. A line in the script that discards completed files can always be included and will just do nothing if it's the first run.
  2. Provide a utility for resetting many systems to the same step. This can be called externally when needed, but I don't think it should be applied blindly internally, lest we wipe lots of progress. For the partially complete files these can be wiped to the same step and n_steps can be adjusted to account for their length.
  3. n_steps should be the number of new steps not total. I looked at OpenMM, LAMMPS, and ASE and even when appending n_steps is the number of new steps, not the total number.

Does this workflow feel too complicated? Let me know what y'all think.

Hi @orionarcher,

Thanks for the detailed suggestions. That makes sense to me, I'll incorporate these suggestions, though due to vacation might only have the time towards the end of the month. Let's also iron out the last two open discussions in this PR and then hopefully we're good to go :).

Daniel

@danielzuegner
Copy link
Contributor Author

@orionarcher could you take another look? I've changed the behavior to always append n_steps in the case of integrate. For optimize I've kept the behavior as-is because the argument is max_steps anyway. I've also added TrajectoryReporter.truncate_to_step to provide a higher-level API to make trajectories consistent in terms of number of steps.

The test failures here appear to be unrelated to this PR, feel free to let me know if you think otherwise.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot append to exisiting trajectory via ts.integrate/ts.optimize

3 participants