|
Checkpointing JobsTable of contents
Starting with version 4.0, GROMACS has built-in, automated checkpointing, see for instance Doing Restarts. Here is a set of generic scripts for checkpointing GROMACS 3.x jobs and automatically resubmitting them. DisclaimerWhile the author uses these scripts and considers them to be high-quality, the author makes no warranty about their accuracy. Further, the scripts that appear below may have been modified from their original content since this is a wiki site. Use these scripts at your own risk.
Estimated Time InvestmentIf you're a bash scripting pro, it shouldn't take too long. Otherwise, plan on spending a day dedicated to getting this up and running. And please, read this entire posting.
Overall Description1. The head.sh script is run from the command line as in interactive job. This script serves to submit a number of jobs that are chained together such that the N+1th job will start once the Nth job has completed. head.sh submits a chain of sub.$CLUSTER.sh jobs, where sub.$CLUSTER.sh is just a wrapper for sub.sh. sub.sh is the real meat of the script and does not need to be customized. 2. A section at the top of the head.sh file must be customized for each run. However, the sub.sh and sub.$CLUSTER.sh scripts do not require any job-specific modification.
Why is it so Complicated ?As the author encountered problems on different clusters, the script was augmented to deal with these issues automatically. 1. The script handles NFS delay problems gracefully 2. The script recovers from crashes gracefully 3. The script checks your .xtc files to ensure that they are complete and free of corruption 4. The script keeps only the most important files from your run (.edr, .xtc, starting .gro) in order to reduce hard disk usage.
Programs that you will needYou will require the G_DESORT package available on the users submission page if you want to use sorting. Currently, sorting is combined with shuffling and there is no option to do shuffling without sorting (but it would be a simple script modification). If you do not want to shuffle or sort then you could set NEVER_USE_SORT_SHUFFLE to a nonzero value in head.sh. Note also that the DD specifier in head.sh specifies the directory in which to find g_desort. A further note about g_desort: The author uses it, but it is not supported by the developers so use it at your own risk.
Cluster-specific modification of these scriptsThe user is expected to customize these scripts for their queuing system. In order to set up your environment, find all cases of the string: case "$CLUSTER" in and add a section for the name that you will give your cluster that contains the appropriate settings. Let's say for example that you use sqsub instead of qsub on your cluser named bigCluster. You would need to modify the following section of head.sh so that the original: case "$CLUSTER" in *) subCommand="qsub" ;; esac becomes: case "$CLUSTER" in bigCluster) subCommand="sqsub" ;; *) subCommand="qsub" ;; esac And then at the top of head.sh you would define: CLUSTER=bigCluster Once you have made these changes, you will not be required to make very much modifications in order to start a new run.
How the iteration process occurs1. User creates two files. echo 0 > finished_test echo 0 > finished_next_start_time Where the finished_test file indicated that the 0th iteration has completed and the finished_next_start_time file indicates that the next round will start at 0ps. 2. User creates a directory called md0_success and puts their starting .gro file there. The naming of the .gro file is important. Let's say your MYMOL (as defined at the top of head.sh) was: MYMOL=proteinInWater Then your md0_success directory must contain a file named proteinInWater_md0_deshuffleddesorted.gro $ls md0_success md0_success/proteinInWater_md0_deshuffleddesorted.gro The 'deshuffleddesorted' part of the name is required even if you do not use sorting or shuffling. It's just built into the script because the script has optional sorting and shuffling. 3. Create your .top and .ndx files and name them as $MYMOL.top and $MYMOL.ndx 4. Create your .mdp file and name it as $MYMOL.mdp 4.a) The last line of your .mdp file must be ;EOF 4.b) Your .mdp file must contain some flags that will be replaced using sed based on your options in head.sh. Those options that must be defined with special flags are these. Make sure your.mdp file contains lines that look exactly like this: cpp = CPP nsteps = NSTEPS tinit = TINIT nstxout = NSTEPS nstvout = NSTEPS nstfout = NSTEPS nstxtcout = SAVE_FREQUENCY 5. cp $MYMOL.mdp $MYMOL_posre.mdp and then modify the new file so that it will be used for any segment with N less than $NJUMPS_POSRE as defined in head.sh. Personally, the author sets the define for position restraints on the protein for the first segment and also sets gen_vel and unconstrained_start differently in the two .mdp files so that there is a generation of velocities on the first segment and a proper restart on other segments. Note: This script never uses tpbconv because it was designed for re-sorting. However, it should be a simple matter to modify it if you want to use tpbconv instead. 6. chmod +x on all the scripts
Location of the ScriptsA) head.sh goes in your working directory along with .mdp .ndx .top files and the md0_success directory that contains the .gro file. B) sub.sh and sub.$CLUSTER.sh files go in /home/`whoami`/scripts/submission/ since many different jobs should all use the same sub.sh script. If you want to put the sub.sh scripts somwehere else, you will need to modify the bottom of head.sh where it submits the sub.sh job.
Running the Scriptcd to your working directory and: nohup ./head.sh &
Files Created During Runtimetmp.sub: the output from your submission that is used to capture job information wait_for_this_pid: the pid of the last job in the chain nohup.out: a list of submitted jobs with pids in the correct chained order DATA/ directory --> This stores your .xtc and your .edr files from N-3 and older where N is the currently running job that has not finished. Note that N-1 and N-2 directories contain all files (.log, .trr, .tpr, etc) but most of these files are deleted when archiving into the DATA directory. If you want to keep more files then you should modify the source code of sub.sh in the 'finished_test' section.
Submitting more jobs to a chain that is already runningJust run head.sh again. It is for this reason that you shouldn't delete the wait_for_this_pid file at any time. Note that if you do delete that file then you can use nohup.out to determine the pid of the last job in the chain and put that into wait_for_this_pid.
The head.sh script$cat head.sh # Automated submission script # Chris Neale November 2007 #!/bin/bash PATH=$PATH:. ########################################################## # Basic setup options: MYMOL=pagagg # starting name for your .gro, .mdp, .top (etc) files REPLICA=1 # Just a flag to differentiate runs if you are running repeat identical simulations at once. MD=/my/directory # working directory CLUSTER=cluster1 # a tag used to define cluster specific characteristics for ease of porting netween clusters NUM_TO_SUBMIT=1 # number of chained jobs to submit MYNP=4 # number of CPUs to occupy TJUMP=500 # Time (in ps) for a single step NJUMPS=1000 # Number of steps before dying and allowing a new chained job to start NFRAMES_IN_XTC=51 # Number of frames that you expect in the .xtc file = TJUMP/(SAVE_FREQUENCY*dt) + 1 SAVE_FREQUENCY=5000 # Save the .xtc every this number of timesteps NJUMPS_POSRE=1 # The number of initial steps that will be done by position restraints ########################################################## # More advanced setup options: RUN_AS_A_TEST=0 # use a test queue NEVER_USE_SORT_SHUFFLE=0 # set to nonzero to avoid sorting and shuffling always RUNTIME_LIMIT=1w # flag to apply a runtime limit to your job APPLY_RUNTIME_LIMIT=0 # set to nonzero to actually apply a runtime limit DOUBLE_CHECK_DESORT=0 # currently not implemented ########################################################## # Things below this line do not usually need to be changed cd ${MD} if((RUN_AS_A_TEST)); then # Override the previously setup options MYNP=2 TJUMP=0.2 NJUMPS=1 NFRAMES_IN_XTC=2 SAVE_FREQUENCY=100 NJUMPS_POSRE=0 testFlag="--test" else testFlag="" fi case "$CLUSTER" in *) if((APPLY_RUNTIME_LIMIT==1)); then timeFlag="-r $RUNTIME_LIMIT" else timeFlag="" fi ;; esac case "$CLUSTER" in *) subCommand="qsub" ;; esac if((MYNP>1)); then case "$CLUSTER" in *) queueIdent="-q mpi --nompirun" ;; esac else case "$CLUSTER" in *) queueIdent="";; esac fi for((i=0;i<NUM_TO_SUBMIT; i++)); do if [ -e DO_NOT_RUN ]; then echo "ERROR: file DO_NOT_RUN exists... exiting" exit fi if [ -e wait_for_this_pid ]; then pid=`cat wait_for_this_pid | awk '{print $1}'` case "$CLUSTER" in *) waitFlag="-w $pid" ;; esac else waitFlag="" fi case "$CLUSTER" in *) nameFlag="-N ${MYMOL}_${REPLICA}" mynpFlag="-n ${MYNP}" subName=".${CLUSTER}" ;; esac ${subCommand} ${testFlag} ${waitFlag} ${timeFlag} ${queueIdent} ${nameFlag} ${mynpFlag} -o `pwd`/out.${i} -e `pwd`/err.${i} /home/`whoami`/scripts/submission/sub${subName}.sh ${MD} ${MYNP} ${MYMOL} ${TJUMP} ${NJUMPS} ${NFRAMES_IN_XTC} ${SAVE_FREQUENCY} ${NJUMPS_POSRE} ${NEVER_USE_SORT_SHUFFLE} ${CLUSTER} ${DOUBLE_CHECK_DESORT} > tmp.sub 2>&1 case "$CLUSTER" in *) pid=`tail -n 1 tmp.sub | awk '{print $4}'` ;; esac echo "Submitting ${subCommand} ${testFlag} ${waitFlag} ${timeFlag} ${queueIdent} ${nameFlag} ${mynpFlag} -o `pwd`/out.${i} -e `pwd`/err.${i} /home/`whoami`/scripts/submission/sub${subName}.sh ${MD} ${MYNP} ${MYMOL} ${TJUMP} ${NJUMPS} ${NFRAMES_IN_XTC} ${SAVE_FREQUENCY} ${NJUMPS_POSRE} ${NEVER_USE_SORT_SHUFFLE} ${CLUSTER} ${DOUBLE_CHECK_DESORT} -- received pid = $pid" echo "$pid" > wait_for_this_pid sleep 1 done
The sub.$CLUSTER.sh script$ cat sub.cluster1.sh #!/bin/bash # This is a wrapper script so that you can define PATH, LD_LIBRARY_PATH # A wrapper is required if you need to define via "#$ -v PATH=", etc... /home/`whoami`/scripts/submission/sub.sh ${1} ${2} ${3} ${4} ${5} ${6} ${7} ${8} ${9} ${10} # If you have severe NFS delay, you may need to include this next line #sleep 60
The sub.sh script$ cat sub.sh # Automated submission script # Chris Neale November 2007 echo -n "TIMING TEST (start): " date #!/bin/bash MD="$1" MYNP="$2" MYMOL="$3" TJUMP="$4" NJUMPS="$5" NFRAMES_IN_XTC="$6" SAVE_FREQUENCY="$7" NJUMPS_POSRE="$8" NEVER_USE_SORT_SHUFFLE="$9" CLUSTER="${10}" DOUBLE_CHECK_DESORT="${11}" case "$CLUSTER" in *) ED=/tools/gromacs-3.3.1/exec/bin PED=${ED} mpiLocation="/tools/openmpi/1.2.1" mpirunProg="${mpiLocation}/bin/mpirun" mdrun_mpiProg="mdrun_openmpi_v1.2.1" ############################################### # For LAM mpi # # mpiLocation="/tools/lam/lam-7.1.2" # # mpirunProg="${mpiLocation}/bin/mpirun C" # # mdrun_mpiProg="mdrun_mpi" # ############################################### ;; esac case "$CLUSTER" in *) DD=/home/`whoami`/gromacs/template ;; esac # The NEW variable allows use of a special writing location (e.g ${TMPDIR} or /scratch/`whoami`) case "$CLUSTER" in *) NEW="." ;; esac case "$CLUSTER" in *) CPP="cpp" ;; esac PATH=$PATH:. cd ${MD} TINY_SLEEP=1 SHORT_SLEEP=10 LONG_SLEEP=60 EXTENDED_SLEEP=300 ############################################### # Startup tests: if [ -e DO_NOT_RUN ]; then echo "ERROR error: file DO_NOT_RUN exists... exiting" exit fi for((v=0;v<2;v++)); do num=0; if [ -e finished_grompp ]; then let "num=$num+1" fi if [ -e finished_mdrun ]; then let "num=$num+1" fi if [ -e finished_desort ]; then let "num=$num+1" fi if [ -e finished_test ]; then let "num=$num+1" fi if((num==0)); then echo "Unsure how to start the run. Check this out" echo "$ls -l finished_grompp finished_mdrun finished_desort finished_test" ls -l finished_grompp finished_mdrun finished_desort finished_test echo "Perhaps you forgot to set a finished_XXX file upon starting your run?" echo " - Otherwise there seems to be an error in the script." else if((num!=1)); then echo "Unsure how to start the run. Check this out" echo "$ls -l finished_grompp finished_mdrun finished_desort finished_test" ls -l finished_grompp finished_mdrun finished_desort finished_test echo "Only one of these files should have existed." fi fi if((num==1)); then break; fi echo "Will sleep then try one more time" sleep {$LONG_SLEEP} done if((num!=1)); then echo "Could not resolve multiple x for finished_x problem. Exiting" touch ./DO_NOT_RUN exit fi ############################################### # Initializations: gromppProblemsInARow=0 reverted=0 MAX_CONSECUTIVE_GROMPP_ERRORS=2 MAX_CONSECUTIVE_MDRUN_ERRORS=2 function waitForExistNotEmpty { # First arg controls usage: # 0 for existance test # 1 for not empty test # -1 for must not exist test # 12 for not empty test plus require single value to equal third arg # Second arg is name of file/directory # Third arg is the expected single value in the file if First arg is = 12 # # Note: This overly complicated procedure is required for proper usage of a # cluster where NFS delay can be significant and simple -s tests # routinely fail to detect the fact that the file is empty as far as # val=`cat file` is concerned notEmpty="$1" case "$notEmpty" in 1*) eneFlag="-s $2" ;; -1) eneFlag="! -e $2" ;; 0) eneFlag="-e $2" ;; *) echo "ERROR error: incorrect argument to waitForExistNotEmpty = $notEmpty" exit ;; esac for((length=1;length<100;length++)); do if [ ${eneFlag} ]; then break fi sleep ${length} done if((length==100)); then echo "ERROR error: Have slept for 30 minutes while waiting for the file $2 to meet conditions [ ${eneFlag} ] ... is there a problem in your script or is the NFS delay very very large?" fi # for all uses other than First Arg = 12 this function is over if((notEmpty!=12)); then return fi expectedVal="$3" for((length=1;length<100;length++)); do currentVal=`cat $2` if [ -n "$currentVal" ]; then # make sure variable is non-empty before making the comparison case "$currentVal" in $expectedVal) echo "NOTE: breaking from loop since currentVal($currentVal)=expectedVal($expectedVal)" break ;; esac fi sleep ${length} done if((length==100)); then echo "ERROR error: Have slept for 30 minutes while waiting for the file $2 to meet conditions `cat $2 = $expectedVal` ... is there a problem in your script or is the NFS delay very very large?" fi } ############################################### # The main loop: for ((njump=0;njump<NJUMPS;njump++)); do # GROMPP (WITH SORTING) if [ -e finished_test ]; then NIN=`cat finished_test` TINIT=`cat finished_next_start_time` let "NOUT=$NIN+1" DIR=md${NOUT}_running PREV=md${NIN}_success if [ ! -e ${PREV} ]; then echo "There was some problem. Expected ${PREV} to exist, but it does not" touch ./DO_NOT_RUN exit fi if [ ! -e ${NEW}/${DIR} ]; then mkdir ${NEW}/${DIR} fi nsteps=`echo "$TJUMP/0.002" | bc -l | awk -F '.' '{print $1}'` if((NOUT<=NJUMPS_POSRE)); then sed "s/TINIT/${TINIT}/" ${MYMOL}_posre.mdp | sed "s/NSTEPS/${nsteps}/" | sed "s/SAVE_FREQUENCY/${SAVE_FREQUENCY}/" | sed "s/CPP/${CPP}/" > ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp posreFlag="-r md0_success/${MYMOL}_md0_deshuffleddesorted.gro" else sed "s/TINIT/${TINIT}/" ${MYMOL}.mdp | sed "s/NSTEPS/${nsteps}/" | sed "s/SAVE_FREQUENCY/${SAVE_FREQUENCY}/" | sed "s/CPP/${CPP}/" > ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp posreFlag="" fi waitForExistNotEmpty 1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${TINY_SLEEP}; fi if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${SHORT_SLEEP}; fi if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${LONG_SLEEP}; fi if((NIN!=0)); then if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then # Can not do shuffle/sort with posre ${ED}/grompp -np ${MYNP} -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -t ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.trr -p ${MYMOL}.top -n ${MYMOL}.ndx -e ${PREV}/${MYMOL}_md${NIN}.edr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr rm mdout.mdp & else # Shuffle the .trr input file correctly. Assume that it is not currently shuffled ${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}_a.ndx rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}_a.ndx mdout.mdp & echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro # g_desort -f original shuffled will unshuffle, therefore g_desort -f shuffled original will REshuffle ${DD}/g_desort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -o ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx -n 6 ${ED}/trjconv -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.trr -o ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr -n ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx # Create the run input file ${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -t ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr -p ${MYMOL}.top -n ${MYMOL}.ndx -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx -e ${PREV}/${MYMOL}_md${NIN}.edr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx mdout.mdp & # In the future: implement this check. Note that this will require rethinking the _a postfixes # since the files without the _a postfixes are the ones that I actually use #if((DOUBLE_CHECK_DESORT!=0)); then # If a new reshuffle.ndx file differs then the run is invalid. echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro ${DD}/g_desort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -o ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx -n 6 case "$CLUSTER" in *) look=`diff -q ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx` ;; esac if [ -n "$look" ]; then echo There was a big problem. ${NEW}/${DIR}/reshuffleresort_md${NOUT}_a.ndx and ${NEW}/${DIR}/reshuffleresort_md${NOUT}.ndx are different. mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_notValid.tpr fi # End of the new test #fi #Create the deshuffle file to properly handle the next run ${DD}/g_desort -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro -o ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -n 6 rm -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx mdout.mdp & fi else if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then # Can not do shuffle/sort with posre ${ED}/grompp -np ${MYNP} -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr rm mdout.mdp & else ${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx mdout.mdp & echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro ${DD}/g_desort -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro -o ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -n 6 rm -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro & fi fi if [ -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then echo ${NOUT} > finished_grompp waitForExistNotEmpty 12 finished_grompp ${NOUT} rm -f finished_test gromppProblemsInARow=0 else # can not revert let "gromppProblemsInARow=$gromppProblemsInARow+1" if((gromppProblemsInARow>MAX_CONSECUTIVE_GROMPP_ERRORS)); then touch ./DO_NOT_RUN exit fi fi fi # MDRUN if [ -e finished_grompp ]; then NOUT=`cat finished_grompp` TINIT=`cat finished_next_start_time` DIR=md${NOUT}_running # Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR} if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then sleep ${SHORT_SLEEP} if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then rm -f finished_grompp echo "ERROR error: finished_grompp existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr did not exist" if((reverted==0)); then echo " reverting to finished_test" reverted=1 let "NIN=$NOUT-1" echo "$NIN" > finished_test if [ ! -s finished_test ]; then sleep ${SHORT_SLEEP}; fi continue else touch ./DO_NOT_RUN exit fi else reverted=0; fi fi if((MYNP==1)); then returnValue=`${ED}/mdrun -deffnm ${NEW}/${DIR}/${MYMOL}_md${NOUT}` else case "$CLUSTER" in *) returnValue=`${mpirunProg} ${PED}/${mdrun_mpiProg} -np ${MYNP} -deffnm ${NEW}/${DIR}/${MYMOL}_md${NOUT}` ;; esac fi if((returnValue!=0)); then echo "ERROR error: mpirun for mdrun_mpi returned non-zero (${returnValue}). Exiting" exit fi echo ${NOUT} > finished_mdrun waitForExistNotEmpty 12 finished_mdrun ${NOUT} rm -f finished_grompp fi # DESORT if [ -e finished_mdrun ]; then NOUT=`cat finished_mdrun` TINIT=`cat finished_next_start_time` DIR=md${NOUT}_running # Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR} if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc ]; then rm -f finished_mdrun echo "ERROR error: finished_mdrun existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc did not exist" if((reverted==0)); then echo " reverting to finished_test" reverted=1 let "NIN=$NOUT-1" echo "$NIN" > finished_test waitForExistNotEmpty 12 finished_test ${NIN} continue else touch ./DO_NOT_RUN exit fi else reverted=0; fi if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then # Can not do shuffle/sort with posre mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro else echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc & echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.trr -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr & echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.gro -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro & # Wait for all 3 desorts to finish wait fi echo ${NOUT} > finished_desort waitForExistNotEmpty 12 finished_desort ${NOUT} rm -f finished_mdrun fi # TEST if [ -e finished_desort ]; then TINIT=`cat finished_next_start_time` runHasNoErrors=1 NOUT=`cat finished_desort` DIR=md${NOUT}_running # Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR} if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ]; then rm -f finished_desort echo "ERROR error: finished_desort existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc did not exist" if((reverted==0)); then echo " reverting to finished_test" reverted=1 let "NIN=$NOUT-1" echo "$NIN" > finished_test waitForExistNotEmpty 12 finished_test ${NIN} continue else touch ./DO_NOT_RUN exit fi else reverted=0; fi ${ED}/gmxcheck -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc 2> ${NEW}/${DIR}/checkXTC # Ensure no magic number error magicNumberError=`grep Error ${NEW}/${DIR}/checkXTC | wc -l` if((magicNumberError==1));then mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.xtc mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.trr mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.gro runHasNoErrors=0 fi ## Ensure expected number of frames is reached -- this is system and mdp file specific numFramesXTC=`grep "^Step" ${NEW}/${DIR}/checkXTC | awk '{print $2}'` if((numFramesXTC!=NFRAMES_IN_XTC)); then mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.xtc mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.trr mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.gro runHasNoErrors=0 fi if((runHasNoErrors)); then mv ${NEW}/${DIR} md${NOUT}_success waitForExistNotEmpty -1 ${NEW}/${DIR} while [ -e ${NEW}/${DIR} ]; do mv ${NEW}/${DIR} md${NOUT}_success sleep ${LONG_SLEEP} done echo ${NOUT} > finished_test nextTime=`echo "${TINIT}+${TJUMP}" | bc -l` echo ${nextTime} > finished_next_start_time waitForExistNotEmpty 12 finished_test ${NOUT} waitForExistNotEmpty 12 finished_next_start_time ${nextTime} # Now do some clean up operations let "NCLEAN=$NOUT-2" if [ -e md${NCLEAN}_success ]; then if [ ! -e DATA ]; then mkdir DATA fi if((NCLEAN!=0)); then # Don't remove or modify the starting directory mkdir DATA/md${NCLEAN}_success mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc DATA/md${NCLEAN}_success waitForExistNotEmpty -1 md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc while [ -e md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc ]; do mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc DATA/md${NCLEAN}_success sleep ${LONG_SLEEP} done mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr DATA/md${NCLEAN}_success waitForExistNotEmpty -1 md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr while [ -e md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr ]; do mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr DATA/md${NCLEAN}_success sleep ${LONG_SLEEP} done rm -rf md${NCLEAN}_success & fi fi else # Send the run back to do the mdrun for((i=1;i<=MAX_CONSECUTIVE_MDRUN_ERRORS;i++)); do if [ ! -e md${NOUT}_failure${i} ]; then break fi done mv ${NEW}/${DIR} md${NOUT}_failure${i} if((i>MAX_CONSECUTIVE_MDRUN_ERRORS)); then echo "Too many failure for run $NOUT" touch ./DO_NOT_RUN exit fi # Send it back by setting as if grompp just finished mkdir ${NEW}/${DIR} mv md${NOUT}_failure${i}/${MYMOL}_md${NOUT}.tpr md${NOUT}_failure${i}/deshuffledesort${MYMOL}_md${NOUT}.ndx ${NEW}/${DIR} echo ${NOUT} > finished_grompp waitForExistNotEmpty 12 finished_grompp ${NOUT} fi rm -f finished_desort fi wait done wait echo -n "TIMING TEST (end): " date |