Page last modified 14:34, 27 May 2010 by mabraham?

Extending Simulations

From $1

Table of contents

It is possible to extend or continue a simulation that has completed, and even those that have crashed (see Doing Restarts). This is actually good technique for the handling of longer simulations to reduce time lost due to crashes of the computer(s) being utilised. It is only possible to continue seamlessly from the point at which the last full precision coordinates and velocities are available for the system. Therefore, you have to ensure that they are written at sufficiently short lengths of time (several times per day?) to avoid loss of too much time due to crashes.

 

Version 4

A simulation that has completed can be extended using tpbconv, mdrun and checkpoint files (.cpt). A simulation that has terminated, but not completed (e.g. because of queue limits or the use of the -maxh option of mdrun) can be continued without needing to use tpbconv.

First, the number of steps or time has to be changed in the .tpr file, then the simulation is continued from the last checkpoint with mdrun. This will produce a binary identical simulation that will be the same as it a continuous run was made.

tpbconv -s previous.tpr -extend timetoextendby -o next.tpr
mdrun -s next.tpr -cpi previous.cpt

You might want to use the -append option of mdrun to append the new output to the old files. Note that this will only work when the old output files have not been modified by the user.

When running in a queuing system, it is useful to set the number of steps you want for the total simulation with grompp or tpbconv and use the -maxh option of mdrun to gracefully stop the simulation before the queue time ends. With this procedure you can simply continue by letting mdrun read checkpoint files and no other tools are required.

The time can also be extended using the -until and -nsteps options with tpbconv.

A simulation can be continued without the checkpoint file, which will be non-binary identical and will have small errors that, for most situations, are negligible. The reason for the errors is that the trajectory and energy files do not store all the state variables of the thermostats and barostats.

Changing .mdp file options

If you wish/need to change .mdp file options, then either

grompp -f new.mdp -c old.tpr -o new.tpr -t old.cpt
mdrun -s new.tpr

or

grompp -f new.mdp -c old.tpr -o new.tpr
mdrun -s new.tpr -cpi old.cpt

should work. The latter might be better. If your old.cpt is for a run that has finished, then use tpbconv -extend after grompp and before mdrun.

Version 3.3.3 and Before

To continue a simulation the coordinates and velocities are required in full precision, this means the .trr trajectory file (.xtc are insufficient as they are in reduced precision and don't have velocity information). How often coordinates and velocities are written to the .trr file is set by nstxout and nstvout in the .mdp file. The process is handled by tpbconv in the following manner:

tpbconv -s previous.tpr -f previous.trr -e previous.edr -o next.tpr -extend timetoextend

Then feed the resulting .tpr file back into mdrun.

 

Exact vs binary identical continuation

If you had a computer with unlimited precision, or if you integrated the time discretized equations of motion by hand, exact continuation would lead to identical results. But since computer have limited precision (single or double precision) and MD is chaotic, trajectories will diverge very rapidly if one bit (the least a significant one) is rounded differently. Such trajectories will all be equally valid, but very different. Continuation using a checkpoint file, using the same code compiled with the same compiler and running on the same computer architecture using the same number of processors will lead to binary identical results, unless timing code is used. Timing code is used by the dynamic load balancing and (by default) by the FFTW FFT library.

In most cases you don't care that a trajectory is not binary identical when you stop it and continue it, as opposed to keeping it running in a single run. The most common case where you do care is when you have a crash at e.g. step 567894 and you want to reproduce it to track down if it is a problem with your system or a bug. mdrun has an option -reprod to (try to) force binary identical simulations. Note that in certain cases this might cost a lot of performance.

Tags:
 
Comments (0)