Page last modified 21:40, 5 Jan 2010 by mabraham?

Checkpointing Jobs

From $1

Table of contents

Here is a set of generic scripts for checkpointing GROMACS 3.x jobs and automatically resubmitting them.

Disclaimer

While the author uses these scripts and considers them to be high-quality, the author makes no warranty about their accuracy. Further, the scripts that appear below may have been modified from their original content since this is a wiki site. Use these scripts at your own risk.

 

Estimated Time Investment

If you're a bash scripting pro, it shouldn't take too long. Otherwise, plan on spending a day dedicated to getting this up and running. And please, read this entire posting.

 

Overall Description

1. The head.sh script is run from the command line as in interactive job. This script serves to submit a number of jobs that are chained together such that the N+1th job will start once the Nth job has completed. head.sh submits a chain of sub.$CLUSTER.sh jobs, where sub.$CLUSTER.sh is just a wrapper for sub.sh. sub.sh is the real meat of the script and does not need to be customized.

2. A section at the top of the head.sh file must be customized for each run. However, the sub.sh and sub.$CLUSTER.sh scripts do not require any job-specific modification.

 

Why is it so Complicated ?

As the author encountered problems on different clusters, the script was augmented to deal with these issues automatically.

1. The script handles NFS delay problems gracefully

2. The script recovers from crashes gracefully

3. The script checks your .xtc files to ensure that they are complete and free of corruption

4. The script keeps only the most important files from your run (.edr, .xtc, starting .gro) in order to reduce hard disk usage.

 

Programs that you will need

You will require the G_DESORT package available on the users submission page if you want to use sorting. Currently, sorting is combined with shuffling and there is no option to do shuffling without sorting (but it would be a simple script modification). If you do not want to shuffle or sort then you could set NEVER_USE_SORT_SHUFFLE to a nonzero value in head.sh. Note also that the DD specifier in head.sh specifies the directory in which to find g_desort. A further note about g_desort: The author uses it, but it is not supported by the developers so use it at your own risk.

 

Cluster-specific modification of these scripts

The user is expected to customize these scripts for their queuing system. In order to set up your environment, find all cases of the string:

case "$CLUSTER" in

and add a section for the name that you will give your cluster that contains the appropriate settings. Let's say for example that you use sqsub instead of qsub on your cluser named bigCluster. You would need to modify the following section of head.sh so that the original:

case "$CLUSTER" in
*)
  subCommand="qsub"
  ;;
esac

becomes:

case "$CLUSTER" in
bigCluster)
  subCommand="sqsub"
  ;;
*)
  subCommand="qsub"
  ;;
esac

And then at the top of head.sh you would define:

CLUSTER=bigCluster

Once you have made these changes, you will not be required to make very much modifications in order to start a new run.

 

How the iteration process occurs

1. User creates two files.

echo 0 > finished_test
echo 0 > finished_next_start_time

Where the finished_test file indicated that the 0th iteration has completed and the finished_next_start_time file indicates that the next round will start at 0ps.

2. User creates a directory called md0_success and puts their starting .gro file there. The naming of the .gro file is important. Let's say your MYMOL (as defined at the top of head.sh) was:

MYMOL=proteinInWater

Then your md0_success directory must contain a file named proteinInWater_md0_deshuffleddesorted.gro

$ls md0_success
md0_success/proteinInWater_md0_deshuffleddesorted.gro

The 'deshuffleddesorted' part of the name is required even if you do not use sorting or shuffling. It's just built into the script because the script has optional sorting and shuffling.

3. Create your .top and .ndx files and name them as $MYMOL.top and $MYMOL.ndx

4. Create your .mdp file and name it as $MYMOL.mdp

4.a) The last line of your .mdp file must be ;EOF

4.b) Your .mdp file must contain some flags that will be replaced using sed based on your options in head.sh. Those options that must be defined with special flags are these. Make sure your.mdp file contains lines that look exactly like this:

cpp                 =  CPP
nsteps              =  NSTEPS
tinit               =  TINIT
nstxout             =  NSTEPS
nstvout             =  NSTEPS
nstfout             =  NSTEPS
nstxtcout           =  SAVE_FREQUENCY

5. cp $MYMOL.mdp $MYMOL_posre.mdp and then modify the new file so that it will be used for any segment with N less than $NJUMPS_POSRE as defined in head.sh. Personally, the author sets the define for position restraints on the protein for the first segment and also sets gen_vel and unconstrained_start differently in the two .mdp files so that there is a generation of velocities on the first segment and a proper restart on other segments.

Note: This script never uses tpbconv because it was designed for re-sorting. However, it should be a simple matter to modify it if you want to use tpbconv instead.

6. chmod +x on all the scripts

 

Location of the Scripts

A) head.sh goes in your working directory along with .mdp .ndx .top files and the md0_success directory that contains the .gro file.

B) sub.sh and sub.$CLUSTER.sh files go in /home/`whoami`/scripts/submission/ since many different jobs should all use the same sub.sh script. If you want to put the sub.sh scripts somwehere else, you will need to modify the bottom of head.sh where it submits the sub.sh job.

 

Running the Script

cd to your working directory and:

nohup ./head.sh &

 

Files Created During Runtime

tmp.sub: the output from your submission that is used to capture job information

wait_for_this_pid: the pid of the last job in the chain

nohup.out: a list of submitted jobs with pids in the correct chained order

DATA/ directory --> This stores your .xtc and your .edr files from N-3 and older where N is the currently running job that has not finished. Note that N-1 and N-2 directories contain all files (.log, .trr, .tpr, etc) but most of these files are deleted when archiving into the DATA directory. If you want to keep more files then you should modify the source code of sub.sh in the 'finished_test' section.

 

Submitting more jobs to a chain that is already running

Just run head.sh again. It is for this reason that you shouldn't delete the wait_for_this_pid file at any time. Note that if you do delete that file then you can use nohup.out to determine the pid of the last job in the chain and put that into wait_for_this_pid.

 

The head.sh script

$cat head.sh
# Automated submission script
# Chris Neale November 2007

#!/bin/bash
PATH=$PATH:.

##########################################################
# Basic setup options:
MYMOL=pagagg         # starting name for your .gro, .mdp, .top (etc) files
REPLICA=1            # Just a flag to differentiate runs if you are running repeat identical simulations at once.
MD=/my/directory     # working directory
CLUSTER=cluster1     # a tag used to define cluster specific characteristics for ease of porting netween clusters
NUM_TO_SUBMIT=1      # number of chained jobs to submit
MYNP=4               # number of CPUs to occupy
TJUMP=500            # Time (in ps) for a single step
NJUMPS=1000          # Number of steps before dying and allowing a new chained job to start
NFRAMES_IN_XTC=51    # Number of frames that you expect in the .xtc file = TJUMP/(SAVE_FREQUENCY*dt) + 1
SAVE_FREQUENCY=5000  # Save the .xtc every this number of timesteps
NJUMPS_POSRE=1       # The number of initial steps that will be done by position restraints


##########################################################
# More advanced setup options:

RUN_AS_A_TEST=0               # use a test queue
NEVER_USE_SORT_SHUFFLE=0      # set to nonzero to avoid sorting and shuffling always
RUNTIME_LIMIT=1w              # flag to apply a runtime limit to your job
APPLY_RUNTIME_LIMIT=0         # set to nonzero to actually apply a runtime limit
DOUBLE_CHECK_DESORT=0         # currently not implemented

##########################################################
# Things below this line do not usually need to be changed

cd ${MD}

if((RUN_AS_A_TEST)); then
  # Override the previously setup options
  MYNP=2
  TJUMP=0.2
  NJUMPS=1
  NFRAMES_IN_XTC=2
  SAVE_FREQUENCY=100
  NJUMPS_POSRE=0
  testFlag="--test"
else
  testFlag=""
fi

case "$CLUSTER" in
*)
  if((APPLY_RUNTIME_LIMIT==1)); then
    timeFlag="-r $RUNTIME_LIMIT"
  else
    timeFlag=""
  fi
  ;;
esac

case "$CLUSTER" in
*)
  subCommand="qsub"
  ;;
esac

if((MYNP>1)); then
  case "$CLUSTER" in
  *)
    queueIdent="-q mpi --nompirun"
    ;;
  esac
else
  case "$CLUSTER" in
  *)
    queueIdent="";;
  esac
fi

for((i=0;i<NUM_TO_SUBMIT; i++)); do

  if [ -e DO_NOT_RUN ]; then
    echo "ERROR: file DO_NOT_RUN exists... exiting"
    exit
  fi
  
  if [ -e wait_for_this_pid ]; then
    pid=`cat wait_for_this_pid | awk '{print $1}'`
    case "$CLUSTER" in
    *)
      waitFlag="-w $pid"
      ;;
    esac
  else
    waitFlag=""
  fi

  case "$CLUSTER" in
  *)
    nameFlag="-N ${MYMOL}_${REPLICA}"
    mynpFlag="-n ${MYNP}"
    subName=".${CLUSTER}"
    ;;
  esac

  ${subCommand} ${testFlag} ${waitFlag} ${timeFlag} ${queueIdent} ${nameFlag} ${mynpFlag} -o `pwd`/out.${i} -e `pwd`/err.${i} /home/`whoami`/scripts/submission/sub${subName}.sh ${MD} ${MYNP} ${MYMOL} ${TJUMP} ${NJUMPS} ${NFRAMES_IN_XTC} ${SAVE_FREQUENCY} ${NJUMPS_POSRE} ${NEVER_USE_SORT_SHUFFLE} ${CLUSTER} ${DOUBLE_CHECK_DESORT} > tmp.sub 2>&1
  case "$CLUSTER" in
  *)
    pid=`tail -n 1 tmp.sub | awk '{print $4}'`
    ;;
  esac
  
  echo "Submitting ${subCommand} ${testFlag} ${waitFlag} ${timeFlag} ${queueIdent} ${nameFlag} ${mynpFlag} -o `pwd`/out.${i} -e `pwd`/err.${i} /home/`whoami`/scripts/submission/sub${subName}.sh ${MD} ${MYNP} ${MYMOL} ${TJUMP} ${NJUMPS} ${NFRAMES_IN_XTC} ${SAVE_FREQUENCY} ${NJUMPS_POSRE} ${NEVER_USE_SORT_SHUFFLE} ${CLUSTER} ${DOUBLE_CHECK_DESORT} -- received pid = $pid"
  echo "$pid" > wait_for_this_pid
  sleep 1

done

 

The sub.$CLUSTER.sh script

$ cat sub.cluster1.sh 
#!/bin/bash

# This is a wrapper script so that you can define PATH, LD_LIBRARY_PATH
# A wrapper is required if you need to define via "#$ -v PATH=", etc...

/home/`whoami`/scripts/submission/sub.sh ${1} ${2} ${3} ${4} ${5} ${6} ${7} ${8} ${9} ${10}

# If you have severe NFS delay, you may need to include this next line
#sleep 60

 

The sub.sh script

$ cat sub.sh 
# Automated submission script
# Chris Neale November 2007

echo -n "TIMING TEST (start): "
date
#!/bin/bash
MD="$1"
MYNP="$2"
MYMOL="$3"
TJUMP="$4"
NJUMPS="$5"
NFRAMES_IN_XTC="$6"
SAVE_FREQUENCY="$7"
NJUMPS_POSRE="$8"
NEVER_USE_SORT_SHUFFLE="$9"
CLUSTER="${10}"
DOUBLE_CHECK_DESORT="${11}"

case "$CLUSTER" in
*)
  ED=/tools/gromacs-3.3.1/exec/bin
  PED=${ED}
  mpiLocation="/tools/openmpi/1.2.1"
  mpirunProg="${mpiLocation}/bin/mpirun"
  mdrun_mpiProg="mdrun_openmpi_v1.2.1"

  ###############################################
  # For LAM mpi                                 #
  # mpiLocation="/tools/lam/lam-7.1.2"          #
  # mpirunProg="${mpiLocation}/bin/mpirun C"    #
  # mdrun_mpiProg="mdrun_mpi"                   #
  ###############################################

  ;;
esac

case "$CLUSTER" in
*)
  DD=/home/`whoami`/gromacs/template
  ;;
esac

# The NEW variable allows use of a special writing location (e.g ${TMPDIR} or /scratch/`whoami`)
case "$CLUSTER" in
*)
  NEW="."
  ;;
esac

case "$CLUSTER" in
*)
  CPP="cpp"
  ;;
esac

PATH=$PATH:.

cd ${MD}

TINY_SLEEP=1
SHORT_SLEEP=10
LONG_SLEEP=60
EXTENDED_SLEEP=300

###############################################
# Startup tests:

if [ -e DO_NOT_RUN ]; then
  echo "ERROR error: file DO_NOT_RUN exists... exiting"
  exit
fi

for((v=0;v<2;v++)); do
  num=0;
  if [ -e finished_grompp ]; then
    let "num=$num+1"
  fi
  if [ -e finished_mdrun ]; then
    let "num=$num+1"
  fi
  if [ -e finished_desort ]; then
    let "num=$num+1"
  fi
  if [ -e finished_test ]; then
    let "num=$num+1"
  fi
  
  if((num==0)); then
    echo "Unsure how to start the run. Check this out"
    echo "$ls -l finished_grompp finished_mdrun finished_desort finished_test"
    ls -l finished_grompp finished_mdrun finished_desort finished_test
    echo "Perhaps you forgot to set a finished_XXX file upon starting your run?"
    echo "  - Otherwise there seems to be an error in the script."
  else 
    if((num!=1)); then
      echo "Unsure how to start the run. Check this out"
      echo "$ls -l finished_grompp finished_mdrun finished_desort finished_test"
      ls -l finished_grompp finished_mdrun finished_desort finished_test
      echo "Only one of these files should have existed."
    fi
  fi
  if((num==1)); then
    break; 
  fi
  echo "Will sleep then try one more time"
  sleep {$LONG_SLEEP}
done

if((num!=1)); then
  echo "Could not resolve multiple x for finished_x problem. Exiting"
  touch ./DO_NOT_RUN
  exit
fi

###############################################
# Initializations:

gromppProblemsInARow=0
reverted=0
MAX_CONSECUTIVE_GROMPP_ERRORS=2
MAX_CONSECUTIVE_MDRUN_ERRORS=2

function waitForExistNotEmpty {
  # First arg controls usage: 
  #   0  for existance test 
  #   1  for not empty test 
  #  -1  for must not exist test
  #   12 for not empty test plus require single value to equal third arg
  # Second arg is name of file/directory
  # Third arg is the expected single value in the file if First arg is = 12 
  #
  # Note: This overly complicated procedure is required for proper usage of a
  #       cluster where NFS delay can be significant and simple -s tests
  #       routinely fail to detect the fact that the file is empty as far as 
  #       val=`cat file` is concerned
  notEmpty="$1"
  case "$notEmpty" in
  1*)
    eneFlag="-s $2"
    ;;
  -1)
    eneFlag="! -e $2"
    ;;
  0)
    eneFlag="-e $2"
    ;;
  *)
    echo "ERROR error: incorrect argument to waitForExistNotEmpty = $notEmpty"
    exit
    ;;
  esac

  for((length=1;length<100;length++)); do
    if [ ${eneFlag} ]; then 
      break
    fi
    sleep ${length}
  done
  if((length==100)); then
    echo "ERROR error: Have slept for 30 minutes while waiting for the file $2 to meet conditions [ ${eneFlag} ] ... is there a problem in your script or is the NFS delay very very large?"
  fi

  # for all uses other than First Arg = 12 this function is over
  if((notEmpty!=12)); then
    return
  fi
  expectedVal="$3"
  for((length=1;length<100;length++)); do
    currentVal=`cat $2`
    if [ -n "$currentVal" ]; then
      # make sure variable is non-empty before making the comparison
      case "$currentVal" in
      $expectedVal)
        echo "NOTE: breaking from loop since currentVal($currentVal)=expectedVal($expectedVal)"
        break
        ;;
      esac
    fi
    sleep ${length}
  done
  if((length==100)); then
    echo "ERROR error: Have slept for 30 minutes while waiting for the file $2 to meet conditions `cat $2 = $expectedVal` ... is there a problem in your script or is the NFS delay very very large?"
  fi
}

###############################################
# The main loop:

for ((njump=0;njump<NJUMPS;njump++)); do

  # GROMPP (WITH SORTING)
  if [ -e finished_test ]; then
    NIN=`cat finished_test`
    TINIT=`cat finished_next_start_time`
    let "NOUT=$NIN+1"
    DIR=md${NOUT}_running
    PREV=md${NIN}_success
    if [ ! -e ${PREV} ]; then
      echo "There was some problem. Expected ${PREV} to exist, but it does not"
      touch ./DO_NOT_RUN
      exit
    fi
    if [ ! -e ${NEW}/${DIR} ]; then
      mkdir ${NEW}/${DIR}
    fi

    nsteps=`echo "$TJUMP/0.002" | bc -l | awk -F '.' '{print $1}'`
    if((NOUT<=NJUMPS_POSRE)); then
      sed "s/TINIT/${TINIT}/" ${MYMOL}_posre.mdp | sed "s/NSTEPS/${nsteps}/" | sed "s/SAVE_FREQUENCY/${SAVE_FREQUENCY}/" | sed "s/CPP/${CPP}/" > ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp
      posreFlag="-r md0_success/${MYMOL}_md0_deshuffleddesorted.gro"
    else
      sed "s/TINIT/${TINIT}/" ${MYMOL}.mdp | sed "s/NSTEPS/${nsteps}/" | sed "s/SAVE_FREQUENCY/${SAVE_FREQUENCY}/" | sed "s/CPP/${CPP}/" > ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp
      posreFlag=""
    fi
    waitForExistNotEmpty 1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp
    if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${TINY_SLEEP}; fi
    if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${SHORT_SLEEP}; fi
    if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${LONG_SLEEP}; fi

    if((NIN!=0)); then
      if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then
        # Can not do shuffle/sort with posre
        ${ED}/grompp -np ${MYNP} -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -t ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.trr -p ${MYMOL}.top -n ${MYMOL}.ndx -e ${PREV}/${MYMOL}_md${NIN}.edr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
        rm mdout.mdp &
      else
        # Shuffle the .trr input file correctly. Assume that it is not currently shuffled
        ${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}_a.ndx 
        rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}_a.ndx mdout.mdp &
        echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro
        # g_desort -f original shuffled will unshuffle, therefore g_desort -f shuffled original will REshuffle
        ${DD}/g_desort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -o ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx -n 6
        ${ED}/trjconv -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.trr -o ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr -n ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx
        # Create the run input file
        ${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -t ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr -p ${MYMOL}.top -n ${MYMOL}.ndx -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx -e ${PREV}/${MYMOL}_md${NIN}.edr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
        rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx mdout.mdp &

        # In the future: implement this check. Note that this will require rethinking the _a postfixes
        #                since the files without the _a postfixes are the ones that I actually use
        #if((DOUBLE_CHECK_DESORT!=0)); then

          # If a new reshuffle.ndx file differs then the run is invalid.
          echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro
          ${DD}/g_desort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -o ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx -n 6
          case "$CLUSTER" in
          *)
            look=`diff -q ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx`
            ;;
          esac
          if [ -n "$look" ]; then
            echo There was a big problem. ${NEW}/${DIR}/reshuffleresort_md${NOUT}_a.ndx and ${NEW}/${DIR}/reshuffleresort_md${NOUT}.ndx are different.
            mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_notValid.tpr
          fi

        # End of the new test
        #fi

        #Create the deshuffle file to properly handle the next run
        ${DD}/g_desort -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro -o ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -n 6
        rm -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx mdout.mdp &
      fi  
    else
      if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then
        # Can not do shuffle/sort with posre
        ${ED}/grompp -np ${MYNP} -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
        rm mdout.mdp &
      else
        ${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
        rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx mdout.mdp &
        echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro
        ${DD}/g_desort -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro -o ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -n 6
        rm -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro &
      fi
    fi

    if [ -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then
      echo ${NOUT} > finished_grompp
      waitForExistNotEmpty 12 finished_grompp ${NOUT}
      rm -f finished_test
      gromppProblemsInARow=0
    else
      # can not revert
      let "gromppProblemsInARow=$gromppProblemsInARow+1"
      if((gromppProblemsInARow>MAX_CONSECUTIVE_GROMPP_ERRORS)); then
        touch ./DO_NOT_RUN
        exit
      fi
    fi

  fi

  # MDRUN
  if [ -e finished_grompp ]; then
    NOUT=`cat finished_grompp`
    TINIT=`cat finished_next_start_time`
    DIR=md${NOUT}_running

    # Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR}
    if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then
      sleep ${SHORT_SLEEP}
      if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then
        rm -f finished_grompp
        echo "ERROR error: finished_grompp existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr did not exist"
        if((reverted==0)); then
          echo "       reverting to finished_test"
          reverted=1
          let "NIN=$NOUT-1"
          echo "$NIN" > finished_test
          if [ ! -s finished_test ]; then sleep ${SHORT_SLEEP}; fi
          continue
        else
          touch ./DO_NOT_RUN
          exit
        fi
      else
        reverted=0;
      fi
    fi

    if((MYNP==1)); then
      returnValue=`${ED}/mdrun -deffnm ${NEW}/${DIR}/${MYMOL}_md${NOUT}`
    else
      case "$CLUSTER" in
      *)
        returnValue=`${mpirunProg} ${PED}/${mdrun_mpiProg} -np ${MYNP} -deffnm ${NEW}/${DIR}/${MYMOL}_md${NOUT}`
        ;;
      esac
    fi
    if((returnValue!=0)); then
      echo "ERROR error: mpirun for mdrun_mpi returned non-zero (${returnValue}). Exiting"
      exit
    fi
    echo ${NOUT} > finished_mdrun
    waitForExistNotEmpty 12 finished_mdrun ${NOUT}
    rm -f finished_grompp
  fi

  # DESORT
  if [ -e finished_mdrun ]; then
    NOUT=`cat finished_mdrun`
    TINIT=`cat finished_next_start_time`
    DIR=md${NOUT}_running

    # Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR}
    if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc ]; then
      rm -f finished_mdrun
      echo "ERROR error: finished_mdrun existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc did not exist"
      if((reverted==0)); then
        echo "       reverting to finished_test"
        reverted=1
        let "NIN=$NOUT-1"
        echo "$NIN" > finished_test
        waitForExistNotEmpty 12 finished_test ${NIN}
        continue
      else
        touch ./DO_NOT_RUN
        exit
      fi
    else
      reverted=0;
    fi

    if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then
      # Can not do shuffle/sort with posre
      mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc
      mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr
      mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro
    else
      echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc &
      echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.trr -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr &
      echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.gro -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro &
      # Wait for all 3 desorts to finish
      wait
    fi

    echo ${NOUT} > finished_desort
    waitForExistNotEmpty 12 finished_desort ${NOUT}
    rm -f finished_mdrun
  fi

  # TEST
  if [ -e finished_desort ]; then
    TINIT=`cat finished_next_start_time`
    runHasNoErrors=1
    NOUT=`cat finished_desort`
    DIR=md${NOUT}_running

    # Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR}
    if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ]; then
      rm -f finished_desort
      echo "ERROR error: finished_desort existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc did not exist"
      if((reverted==0)); then
        echo "       reverting to finished_test"
        reverted=1
        let "NIN=$NOUT-1"
        echo "$NIN" > finished_test
        waitForExistNotEmpty 12 finished_test ${NIN}
        continue
      else
        touch ./DO_NOT_RUN
        exit
      fi
    else
      reverted=0;
    fi

    ${ED}/gmxcheck -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc 2> ${NEW}/${DIR}/checkXTC
    # Ensure no magic number error
    magicNumberError=`grep Error ${NEW}/${DIR}/checkXTC | wc -l`
    if((magicNumberError==1));then
      mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.xtc
      mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.trr
      mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.gro
      runHasNoErrors=0
    fi
    ## Ensure expected number of frames is reached -- this is system and mdp file specific
    numFramesXTC=`grep "^Step" ${NEW}/${DIR}/checkXTC | awk '{print $2}'`
    if((numFramesXTC!=NFRAMES_IN_XTC)); then
      mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.xtc
      mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.trr
      mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.gro
      runHasNoErrors=0
    fi
    if((runHasNoErrors)); then
      mv ${NEW}/${DIR} md${NOUT}_success
      waitForExistNotEmpty -1 ${NEW}/${DIR}
      while [ -e ${NEW}/${DIR} ]; do
        mv ${NEW}/${DIR} md${NOUT}_success
        sleep ${LONG_SLEEP}
      done
      
      echo ${NOUT} > finished_test
      nextTime=`echo "${TINIT}+${TJUMP}" | bc -l`
      echo ${nextTime} > finished_next_start_time
      waitForExistNotEmpty 12 finished_test ${NOUT}
      waitForExistNotEmpty 12 finished_next_start_time ${nextTime} 

      # Now do some clean up operations
      let "NCLEAN=$NOUT-2"
      if [ -e md${NCLEAN}_success ]; then
        if [ ! -e DATA ]; then
          mkdir DATA
        fi
        if((NCLEAN!=0)); then
          # Don't remove or modify the starting directory
          mkdir DATA/md${NCLEAN}_success
          mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc DATA/md${NCLEAN}_success
          waitForExistNotEmpty -1 md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc
          while [ -e md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc ]; do
            mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc DATA/md${NCLEAN}_success
            sleep ${LONG_SLEEP}
          done
          mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr DATA/md${NCLEAN}_success
          waitForExistNotEmpty -1 md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr
          while [ -e md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr ]; do
            mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr DATA/md${NCLEAN}_success
            sleep ${LONG_SLEEP}
          done
          rm -rf md${NCLEAN}_success &
        fi
      fi
    else
      # Send the run back to do the mdrun
      for((i=1;i<=MAX_CONSECUTIVE_MDRUN_ERRORS;i++)); do
        if [ ! -e md${NOUT}_failure${i} ]; then
          break
        fi
      done
      mv ${NEW}/${DIR} md${NOUT}_failure${i}
      if((i>MAX_CONSECUTIVE_MDRUN_ERRORS)); then
        echo "Too many failure for run $NOUT"
        touch ./DO_NOT_RUN
        exit
      fi
      # Send it back by setting as if grompp just finished
      mkdir ${NEW}/${DIR} 
      mv md${NOUT}_failure${i}/${MYMOL}_md${NOUT}.tpr md${NOUT}_failure${i}/deshuffledesort${MYMOL}_md${NOUT}.ndx ${NEW}/${DIR}
      echo ${NOUT} > finished_grompp
      waitForExistNotEmpty 12 finished_grompp ${NOUT}
    fi
    rm -f finished_desort
  fi

  wait
done 

wait


echo -n "TIMING TEST (end): "
date

Tags:
 
Comments (0)