|
|
Checkpointing JobsTable of contents
Starting with version 4.0, GROMACS has built-in, automated checkpointing, see for instance Doing Restarts. Here is a set of generic scripts for checkpointing GROMACS 3.x jobs and automatically resubmitting them. DisclaimerWhile the author uses these scripts and considers them to be high-quality, the author makes no warranty about their accuracy. Further, the scripts that appear below may have been modified from their original content since this is a wiki site. Use these scripts at your own risk.
Estimated Time InvestmentIf you're a bash scripting pro, it shouldn't take too long. Otherwise, plan on spending a day dedicated to getting this up and running. And please, read this entire posting.
Overall Description1. The head.sh script is run from the command line as in interactive job. This script serves to submit a number of jobs that are chained together such that the N+1th job will start once the Nth job has completed. head.sh submits a chain of sub.$CLUSTER.sh jobs, where sub.$CLUSTER.sh is just a wrapper for sub.sh. sub.sh is the real meat of the script and does not need to be customized. 2. A section at the top of the head.sh file must be customized for each run. However, the sub.sh and sub.$CLUSTER.sh scripts do not require any job-specific modification.
Why is it so Complicated ?As the author encountered problems on different clusters, the script was augmented to deal with these issues automatically. 1. The script handles NFS delay problems gracefully 2. The script recovers from crashes gracefully 3. The script checks your .xtc files to ensure that they are complete and free of corruption 4. The script keeps only the most important files from your run (.edr, .xtc, starting .gro) in order to reduce hard disk usage.
Programs that you will needYou will require the G_DESORT package available on the users submission page if you want to use sorting. Currently, sorting is combined with shuffling and there is no option to do shuffling without sorting (but it would be a simple script modification). If you do not want to shuffle or sort then you could set NEVER_USE_SORT_SHUFFLE to a nonzero value in head.sh. Note also that the DD specifier in head.sh specifies the directory in which to find g_desort. A further note about g_desort: The author uses it, but it is not supported by the developers so use it at your own risk.
Cluster-specific modification of these scriptsThe user is expected to customize these scripts for their queuing system. In order to set up your environment, find all cases of the string: case "$CLUSTER" in and add a section for the name that you will give your cluster that contains the appropriate settings. Let's say for example that you use sqsub instead of qsub on your cluser named bigCluster. You would need to modify the following section of head.sh so that the original: case "$CLUSTER" in *) subCommand="qsub" ;; esac becomes: case "$CLUSTER" in bigCluster) subCommand="sqsub" ;; *) subCommand="qsub" ;; esac And then at the top of head.sh you would define: CLUSTER=bigCluster Once you have made these changes, you will not be required to make very much modifications in order to start a new run.
How the iteration process occurs1. User creates two files. echo 0 > finished_test echo 0 > finished_next_start_time Where the finished_test file indicated that the 0th iteration has completed and the finished_next_start_time file indicates that the next round will start at 0ps. 2. User creates a directory called md0_success and puts their starting .gro file there. The naming of the .gro file is important. Let's say your MYMOL (as defined at the top of head.sh) was: MYMOL=proteinInWater Then your md0_success directory must contain a file named proteinInWater_md0_deshuffleddesorted.gro $ls md0_success md0_success/proteinInWater_md0_deshuffleddesorted.gro The 'deshuffleddesorted' part of the name is required even if you do not use sorting or shuffling. It's just built into the script because the script has optional sorting and shuffling. 3. Create your .top and .ndx files and name them as $MYMOL.top and $MYMOL.ndx 4. Create your .mdp file and name it as $MYMOL.mdp 4.a) The last line of your .mdp file must be ;EOF 4.b) Your .mdp file must contain some flags that will be replaced using sed based on your options in head.sh. Those options that must be defined with special flags are these. Make sure your.mdp file contains lines that look exactly like this: cpp = CPP nsteps = NSTEPS tinit = TINIT nstxout = NSTEPS nstvout = NSTEPS nstfout = NSTEPS nstxtcout = SAVE_FREQUENCY 5. cp $MYMOL.mdp $MYMOL_posre.mdp and then modify the new file so that it will be used for any segment with N less than $NJUMPS_POSRE as defined in head.sh. Personally, the author sets the define for position restraints on the protein for the first segment and also sets gen_vel and unconstrained_start differently in the two .mdp files so that there is a generation of velocities on the first segment and a proper restart on other segments. Note: This script never uses tpbconv because it was designed for re-sorting. However, it should be a simple matter to modify it if you want to use tpbconv instead. 6. chmod +x on all the scripts
Location of the ScriptsA) head.sh goes in your working directory along with .mdp .ndx .top files and the md0_success directory that contains the .gro file. B) sub.sh and sub.$CLUSTER.sh files go in /home/`whoami`/scripts/submission/ since many different jobs should all use the same sub.sh script. If you want to put the sub.sh scripts somwehere else, you will need to modify the bottom of head.sh where it submits the sub.sh job.
Running the Scriptcd to your working directory and: nohup ./head.sh &
Files Created During Runtimetmp.sub: the output from your submission that is used to capture job information wait_for_this_pid: the pid of the last job in the chain nohup.out: a list of submitted jobs with pids in the correct chained order DATA/ directory --> This stores your .xtc and your .edr files from N-3 and older where N is the currently running job that has not finished. Note that N-1 and N-2 directories contain all files (.log, .trr, .tpr, etc) but most of these files are deleted when archiving into the DATA directory. If you want to keep more files then you should modify the source code of sub.sh in the 'finished_test' section.
Submitting more jobs to a chain that is already runningJust run head.sh again. It is for this reason that you shouldn't delete the wait_for_this_pid file at any time. Note that if you do delete that file then you can use nohup.out to determine the pid of the last job in the chain and put that into wait_for_this_pid.
The head.sh script$cat head.sh
# Automated submission script
# Chris Neale November 2007
#!/bin/bash
PATH=$PATH:.
##########################################################
# Basic setup options:
MYMOL=pagagg # starting name for your .gro, .mdp, .top (etc) files
REPLICA=1 # Just a flag to differentiate runs if you are running repeat identical simulations at once.
MD=/my/directory # working directory
CLUSTER=cluster1 # a tag used to define cluster specific characteristics for ease of porting netween clusters
NUM_TO_SUBMIT=1 # number of chained jobs to submit
MYNP=4 # number of CPUs to occupy
TJUMP=500 # Time (in ps) for a single step
NJUMPS=1000 # Number of steps before dying and allowing a new chained job to start
NFRAMES_IN_XTC=51 # Number of frames that you expect in the .xtc file = TJUMP/(SAVE_FREQUENCY*dt) + 1
SAVE_FREQUENCY=5000 # Save the .xtc every this number of timesteps
NJUMPS_POSRE=1 # The number of initial steps that will be done by position restraints
##########################################################
# More advanced setup options:
RUN_AS_A_TEST=0 # use a test queue
NEVER_USE_SORT_SHUFFLE=0 # set to nonzero to avoid sorting and shuffling always
RUNTIME_LIMIT=1w # flag to apply a runtime limit to your job
APPLY_RUNTIME_LIMIT=0 # set to nonzero to actually apply a runtime limit
DOUBLE_CHECK_DESORT=0 # currently not implemented
##########################################################
# Things below this line do not usually need to be changed
cd ${MD}
if((RUN_AS_A_TEST)); then
# Override the previously setup options
MYNP=2
TJUMP=0.2
NJUMPS=1
NFRAMES_IN_XTC=2
SAVE_FREQUENCY=100
NJUMPS_POSRE=0
testFlag="--test"
else
testFlag=""
fi
case "$CLUSTER" in
*)
if((APPLY_RUNTIME_LIMIT==1)); then
timeFlag="-r $RUNTIME_LIMIT"
else
timeFlag=""
fi
;;
esac
case "$CLUSTER" in
*)
subCommand="qsub"
;;
esac
if((MYNP>1)); then
case "$CLUSTER" in
*)
queueIdent="-q mpi --nompirun"
;;
esac
else
case "$CLUSTER" in
*)
queueIdent="";;
esac
fi
for((i=0;i<NUM_TO_SUBMIT; i++)); do
if [ -e DO_NOT_RUN ]; then
echo "ERROR: file DO_NOT_RUN exists... exiting"
exit
fi
if [ -e wait_for_this_pid ]; then
pid=`cat wait_for_this_pid | awk '{print $1}'`
case "$CLUSTER" in
*)
waitFlag="-w $pid"
;;
esac
else
waitFlag=""
fi
case "$CLUSTER" in
*)
nameFlag="-N ${MYMOL}_${REPLICA}"
mynpFlag="-n ${MYNP}"
subName=".${CLUSTER}"
;;
esac
${subCommand} ${testFlag} ${waitFlag} ${timeFlag} ${queueIdent} ${nameFlag} ${mynpFlag} -o `pwd`/out.${i} -e `pwd`/err.${i} /home/`whoami`/scripts/submission/sub${subName}.sh ${MD} ${MYNP} ${MYMOL} ${TJUMP} ${NJUMPS} ${NFRAMES_IN_XTC} ${SAVE_FREQUENCY} ${NJUMPS_POSRE} ${NEVER_USE_SORT_SHUFFLE} ${CLUSTER} ${DOUBLE_CHECK_DESORT} > tmp.sub 2>&1
case "$CLUSTER" in
*)
pid=`tail -n 1 tmp.sub | awk '{print $4}'`
;;
esac
echo "Submitting ${subCommand} ${testFlag} ${waitFlag} ${timeFlag} ${queueIdent} ${nameFlag} ${mynpFlag} -o `pwd`/out.${i} -e `pwd`/err.${i} /home/`whoami`/scripts/submission/sub${subName}.sh ${MD} ${MYNP} ${MYMOL} ${TJUMP} ${NJUMPS} ${NFRAMES_IN_XTC} ${SAVE_FREQUENCY} ${NJUMPS_POSRE} ${NEVER_USE_SORT_SHUFFLE} ${CLUSTER} ${DOUBLE_CHECK_DESORT} -- received pid = $pid"
echo "$pid" > wait_for_this_pid
sleep 1
done
The sub.$CLUSTER.sh script$ cat sub.cluster1.sh
#!/bin/bash
# This is a wrapper script so that you can define PATH, LD_LIBRARY_PATH
# A wrapper is required if you need to define via "#$ -v PATH=", etc...
/home/`whoami`/scripts/submission/sub.sh ${1} ${2} ${3} ${4} ${5} ${6} ${7} ${8} ${9} ${10}
# If you have severe NFS delay, you may need to include this next line
#sleep 60
The sub.sh script$ cat sub.sh
# Automated submission script
# Chris Neale November 2007
echo -n "TIMING TEST (start): "
date
#!/bin/bash
MD="$1"
MYNP="$2"
MYMOL="$3"
TJUMP="$4"
NJUMPS="$5"
NFRAMES_IN_XTC="$6"
SAVE_FREQUENCY="$7"
NJUMPS_POSRE="$8"
NEVER_USE_SORT_SHUFFLE="$9"
CLUSTER="${10}"
DOUBLE_CHECK_DESORT="${11}"
case "$CLUSTER" in
*)
ED=/tools/gromacs-3.3.1/exec/bin
PED=${ED}
mpiLocation="/tools/openmpi/1.2.1"
mpirunProg="${mpiLocation}/bin/mpirun"
mdrun_mpiProg="mdrun_openmpi_v1.2.1"
###############################################
# For LAM mpi #
# mpiLocation="/tools/lam/lam-7.1.2" #
# mpirunProg="${mpiLocation}/bin/mpirun C" #
# mdrun_mpiProg="mdrun_mpi" #
###############################################
;;
esac
case "$CLUSTER" in
*)
DD=/home/`whoami`/gromacs/template
;;
esac
# The NEW variable allows use of a special writing location (e.g ${TMPDIR} or /scratch/`whoami`)
case "$CLUSTER" in
*)
NEW="."
;;
esac
case "$CLUSTER" in
*)
CPP="cpp"
;;
esac
PATH=$PATH:.
cd ${MD}
TINY_SLEEP=1
SHORT_SLEEP=10
LONG_SLEEP=60
EXTENDED_SLEEP=300
###############################################
# Startup tests:
if [ -e DO_NOT_RUN ]; then
echo "ERROR error: file DO_NOT_RUN exists... exiting"
exit
fi
for((v=0;v<2;v++)); do
num=0;
if [ -e finished_grompp ]; then
let "num=$num+1"
fi
if [ -e finished_mdrun ]; then
let "num=$num+1"
fi
if [ -e finished_desort ]; then
let "num=$num+1"
fi
if [ -e finished_test ]; then
let "num=$num+1"
fi
if((num==0)); then
echo "Unsure how to start the run. Check this out"
echo "$ls -l finished_grompp finished_mdrun finished_desort finished_test"
ls -l finished_grompp finished_mdrun finished_desort finished_test
echo "Perhaps you forgot to set a finished_XXX file upon starting your run?"
echo " - Otherwise there seems to be an error in the script."
else
if((num!=1)); then
echo "Unsure how to start the run. Check this out"
echo "$ls -l finished_grompp finished_mdrun finished_desort finished_test"
ls -l finished_grompp finished_mdrun finished_desort finished_test
echo "Only one of these files should have existed."
fi
fi
if((num==1)); then
break;
fi
echo "Will sleep then try one more time"
sleep {$LONG_SLEEP}
done
if((num!=1)); then
echo "Could not resolve multiple x for finished_x problem. Exiting"
touch ./DO_NOT_RUN
exit
fi
###############################################
# Initializations:
gromppProblemsInARow=0
reverted=0
MAX_CONSECUTIVE_GROMPP_ERRORS=2
MAX_CONSECUTIVE_MDRUN_ERRORS=2
function waitForExistNotEmpty {
# First arg controls usage:
# 0 for existance test
# 1 for not empty test
# -1 for must not exist test
# 12 for not empty test plus require single value to equal third arg
# Second arg is name of file/directory
# Third arg is the expected single value in the file if First arg is = 12
#
# Note: This overly complicated procedure is required for proper usage of a
# cluster where NFS delay can be significant and simple -s tests
# routinely fail to detect the fact that the file is empty as far as
# val=`cat file` is concerned
notEmpty="$1"
case "$notEmpty" in
1*)
eneFlag="-s $2"
;;
-1)
eneFlag="! -e $2"
;;
0)
eneFlag="-e $2"
;;
*)
echo "ERROR error: incorrect argument to waitForExistNotEmpty = $notEmpty"
exit
;;
esac
for((length=1;length<100;length++)); do
if [ ${eneFlag} ]; then
break
fi
sleep ${length}
done
if((length==100)); then
echo "ERROR error: Have slept for 30 minutes while waiting for the file $2 to meet conditions [ ${eneFlag} ] ... is there a problem in your script or is the NFS delay very very large?"
fi
# for all uses other than First Arg = 12 this function is over
if((notEmpty!=12)); then
return
fi
expectedVal="$3"
for((length=1;length<100;length++)); do
currentVal=`cat $2`
if [ -n "$currentVal" ]; then
# make sure variable is non-empty before making the comparison
case "$currentVal" in
$expectedVal)
echo "NOTE: breaking from loop since currentVal($currentVal)=expectedVal($expectedVal)"
break
;;
esac
fi
sleep ${length}
done
if((length==100)); then
echo "ERROR error: Have slept for 30 minutes while waiting for the file $2 to meet conditions `cat $2 = $expectedVal` ... is there a problem in your script or is the NFS delay very very large?"
fi
}
###############################################
# The main loop:
for ((njump=0;njump<NJUMPS;njump++)); do
# GROMPP (WITH SORTING)
if [ -e finished_test ]; then
NIN=`cat finished_test`
TINIT=`cat finished_next_start_time`
let "NOUT=$NIN+1"
DIR=md${NOUT}_running
PREV=md${NIN}_success
if [ ! -e ${PREV} ]; then
echo "There was some problem. Expected ${PREV} to exist, but it does not"
touch ./DO_NOT_RUN
exit
fi
if [ ! -e ${NEW}/${DIR} ]; then
mkdir ${NEW}/${DIR}
fi
nsteps=`echo "$TJUMP/0.002" | bc -l | awk -F '.' '{print $1}'`
if((NOUT<=NJUMPS_POSRE)); then
sed "s/TINIT/${TINIT}/" ${MYMOL}_posre.mdp | sed "s/NSTEPS/${nsteps}/" | sed "s/SAVE_FREQUENCY/${SAVE_FREQUENCY}/" | sed "s/CPP/${CPP}/" > ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp
posreFlag="-r md0_success/${MYMOL}_md0_deshuffleddesorted.gro"
else
sed "s/TINIT/${TINIT}/" ${MYMOL}.mdp | sed "s/NSTEPS/${nsteps}/" | sed "s/SAVE_FREQUENCY/${SAVE_FREQUENCY}/" | sed "s/CPP/${CPP}/" > ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp
posreFlag=""
fi
waitForExistNotEmpty 1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp
if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${TINY_SLEEP}; fi
if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${SHORT_SLEEP}; fi
if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${LONG_SLEEP}; fi
if((NIN!=0)); then
if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then
# Can not do shuffle/sort with posre
${ED}/grompp -np ${MYNP} -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -t ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.trr -p ${MYMOL}.top -n ${MYMOL}.ndx -e ${PREV}/${MYMOL}_md${NIN}.edr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
rm mdout.mdp &
else
# Shuffle the .trr input file correctly. Assume that it is not currently shuffled
${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}_a.ndx
rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}_a.ndx mdout.mdp &
echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro
# g_desort -f original shuffled will unshuffle, therefore g_desort -f shuffled original will REshuffle
${DD}/g_desort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -o ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx -n 6
${ED}/trjconv -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.trr -o ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr -n ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx
# Create the run input file
${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -t ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr -p ${MYMOL}.top -n ${MYMOL}.ndx -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx -e ${PREV}/${MYMOL}_md${NIN}.edr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx mdout.mdp &
# In the future: implement this check. Note that this will require rethinking the _a postfixes
# since the files without the _a postfixes are the ones that I actually use
#if((DOUBLE_CHECK_DESORT!=0)); then
# If a new reshuffle.ndx file differs then the run is invalid.
echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro
${DD}/g_desort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -o ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx -n 6
case "$CLUSTER" in
*)
look=`diff -q ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx`
;;
esac
if [ -n "$look" ]; then
echo There was a big problem. ${NEW}/${DIR}/reshuffleresort_md${NOUT}_a.ndx and ${NEW}/${DIR}/reshuffleresort_md${NOUT}.ndx are different.
mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_notValid.tpr
fi
# End of the new test
#fi
#Create the deshuffle file to properly handle the next run
${DD}/g_desort -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro -o ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -n 6
rm -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx mdout.mdp &
fi
else
if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then
# Can not do shuffle/sort with posre
${ED}/grompp -np ${MYNP} -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
rm mdout.mdp &
else
${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx mdout.mdp &
echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro
${DD}/g_desort -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro -o ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -n 6
rm -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro &
fi
fi
if [ -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then
echo ${NOUT} > finished_grompp
waitForExistNotEmpty 12 finished_grompp ${NOUT}
rm -f finished_test
gromppProblemsInARow=0
else
# can not revert
let "gromppProblemsInARow=$gromppProblemsInARow+1"
if((gromppProblemsInARow>MAX_CONSECUTIVE_GROMPP_ERRORS)); then
touch ./DO_NOT_RUN
exit
fi
fi
fi
# MDRUN
if [ -e finished_grompp ]; then
NOUT=`cat finished_grompp`
TINIT=`cat finished_next_start_time`
DIR=md${NOUT}_running
# Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR}
if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then
sleep ${SHORT_SLEEP}
if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then
rm -f finished_grompp
echo "ERROR error: finished_grompp existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr did not exist"
if((reverted==0)); then
echo " reverting to finished_test"
reverted=1
let "NIN=$NOUT-1"
echo "$NIN" > finished_test
if [ ! -s finished_test ]; then sleep ${SHORT_SLEEP}; fi
continue
else
touch ./DO_NOT_RUN
exit
fi
else
reverted=0;
fi
fi
if((MYNP==1)); then
returnValue=`${ED}/mdrun -deffnm ${NEW}/${DIR}/${MYMOL}_md${NOUT}`
else
case "$CLUSTER" in
*)
returnValue=`${mpirunProg} ${PED}/${mdrun_mpiProg} -np ${MYNP} -deffnm ${NEW}/${DIR}/${MYMOL}_md${NOUT}`
;;
esac
fi
if((returnValue!=0)); then
echo "ERROR error: mpirun for mdrun_mpi returned non-zero (${returnValue}). Exiting"
exit
fi
echo ${NOUT} > finished_mdrun
waitForExistNotEmpty 12 finished_mdrun ${NOUT}
rm -f finished_grompp
fi
# DESORT
if [ -e finished_mdrun ]; then
NOUT=`cat finished_mdrun`
TINIT=`cat finished_next_start_time`
DIR=md${NOUT}_running
# Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR}
if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc ]; then
rm -f finished_mdrun
echo "ERROR error: finished_mdrun existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc did not exist"
if((reverted==0)); then
echo " reverting to finished_test"
reverted=1
let "NIN=$NOUT-1"
echo "$NIN" > finished_test
waitForExistNotEmpty 12 finished_test ${NIN}
continue
else
touch ./DO_NOT_RUN
exit
fi
else
reverted=0;
fi
if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then
# Can not do shuffle/sort with posre
mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc
mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr
mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro
else
echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc &
echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.trr -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr &
echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.gro -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro &
# Wait for all 3 desorts to finish
wait
fi
echo ${NOUT} > finished_desort
waitForExistNotEmpty 12 finished_desort ${NOUT}
rm -f finished_mdrun
fi
# TEST
if [ -e finished_desort ]; then
TINIT=`cat finished_next_start_time`
runHasNoErrors=1
NOUT=`cat finished_desort`
DIR=md${NOUT}_running
# Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR}
if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ]; then
rm -f finished_desort
echo "ERROR error: finished_desort existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc did not exist"
if((reverted==0)); then
echo " reverting to finished_test"
reverted=1
let "NIN=$NOUT-1"
echo "$NIN" > finished_test
waitForExistNotEmpty 12 finished_test ${NIN}
continue
else
touch ./DO_NOT_RUN
exit
fi
else
reverted=0;
fi
${ED}/gmxcheck -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc 2> ${NEW}/${DIR}/checkXTC
# Ensure no magic number error
magicNumberError=`grep Error ${NEW}/${DIR}/checkXTC | wc -l`
if((magicNumberError==1));then
mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.xtc
mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.trr
mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.gro
runHasNoErrors=0
fi
## Ensure expected number of frames is reached -- this is system and mdp file specific
numFramesXTC=`grep "^Step" ${NEW}/${DIR}/checkXTC | awk '{print $2}'`
if((numFramesXTC!=NFRAMES_IN_XTC)); then
mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.xtc
mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.trr
mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.gro
runHasNoErrors=0
fi
if((runHasNoErrors)); then
mv ${NEW}/${DIR} md${NOUT}_success
waitForExistNotEmpty -1 ${NEW}/${DIR}
while [ -e ${NEW}/${DIR} ]; do
mv ${NEW}/${DIR} md${NOUT}_success
sleep ${LONG_SLEEP}
done
echo ${NOUT} > finished_test
nextTime=`echo "${TINIT}+${TJUMP}" | bc -l`
echo ${nextTime} > finished_next_start_time
waitForExistNotEmpty 12 finished_test ${NOUT}
waitForExistNotEmpty 12 finished_next_start_time ${nextTime}
# Now do some clean up operations
let "NCLEAN=$NOUT-2"
if [ -e md${NCLEAN}_success ]; then
if [ ! -e DATA ]; then
mkdir DATA
fi
if((NCLEAN!=0)); then
# Don't remove or modify the starting directory
mkdir DATA/md${NCLEAN}_success
mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc DATA/md${NCLEAN}_success
waitForExistNotEmpty -1 md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc
while [ -e md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc ]; do
mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc DATA/md${NCLEAN}_success
sleep ${LONG_SLEEP}
done
mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr DATA/md${NCLEAN}_success
waitForExistNotEmpty -1 md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr
while [ -e md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr ]; do
mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr DATA/md${NCLEAN}_success
sleep ${LONG_SLEEP}
done
rm -rf md${NCLEAN}_success &
fi
fi
else
# Send the run back to do the mdrun
for((i=1;i<=MAX_CONSECUTIVE_MDRUN_ERRORS;i++)); do
if [ ! -e md${NOUT}_failure${i} ]; then
break
fi
done
mv ${NEW}/${DIR} md${NOUT}_failure${i}
if((i>MAX_CONSECUTIVE_MDRUN_ERRORS)); then
echo "Too many failure for run $NOUT"
touch ./DO_NOT_RUN
exit
fi
# Send it back by setting as if grompp just finished
mkdir ${NEW}/${DIR}
mv md${NOUT}_failure${i}/${MYMOL}_md${NOUT}.tpr md${NOUT}_failure${i}/deshuffledesort${MYMOL}_md${NOUT}.ndx ${NEW}/${DIR}
echo ${NOUT} > finished_grompp
waitForExistNotEmpty 12 finished_grompp ${NOUT}
fi
rm -f finished_desort
fi
wait
done
wait
echo -n "TIMING TEST (end): "
date
|