Gromacs

Checkpointing Jobs

    Starting with version 4.0, GROMACS has built-in, automated checkpointing, see for instance Doing Restarts.

    Here is a set of generic scripts for checkpointing GROMACS 3.x jobs and automatically resubmitting them.

    Disclaimer

    While the author uses these scripts and considers them to be high-quality, the author makes no warranty about their accuracy. Further, the scripts that appear below may have been modified from their original content since this is a wiki site. Use these scripts at your own risk.

     

    Estimated Time Investment

    If you're a bash scripting pro, it shouldn't take too long. Otherwise, plan on spending a day dedicated to getting this up and running. And please, read this entire posting.

     

    Overall Description

    1. The head.sh script is run from the command line as in interactive job. This script serves to submit a number of jobs that are chained together such that the N+1th job will start once the Nth job has completed. head.sh submits a chain of sub.$CLUSTER.sh jobs, where sub.$CLUSTER.sh is just a wrapper for sub.sh. sub.sh is the real meat of the script and does not need to be customized.

    2. A section at the top of the head.sh file must be customized for each run. However, the sub.sh and sub.$CLUSTER.sh scripts do not require any job-specific modification.

     

    Why is it so Complicated ?

    As the author encountered problems on different clusters, the script was augmented to deal with these issues automatically.

    1. The script handles NFS delay problems gracefully

    2. The script recovers from crashes gracefully

    3. The script checks your .xtc files to ensure that they are complete and free of corruption

    4. The script keeps only the most important files from your run (.edr, .xtc, starting .gro) in order to reduce hard disk usage.

     

    Programs that you will need

    You will require the G_DESORT package available on the users submission page if you want to use sorting. Currently, sorting is combined with shuffling and there is no option to do shuffling without sorting (but it would be a simple script modification). If you do not want to shuffle or sort then you could set NEVER_USE_SORT_SHUFFLE to a nonzero value in head.sh. Note also that the DD specifier in head.sh specifies the directory in which to find g_desort. A further note about g_desort: The author uses it, but it is not supported by the developers so use it at your own risk.

     

    Cluster-specific modification of these scripts

    The user is expected to customize these scripts for their queuing system. In order to set up your environment, find all cases of the string:

    case "$CLUSTER" in
    

    and add a section for the name that you will give your cluster that contains the appropriate settings. Let's say for example that you use sqsub instead of qsub on your cluser named bigCluster. You would need to modify the following section of head.sh so that the original:

    case "$CLUSTER" in
    *)
      subCommand="qsub"
      ;;
    esac
    

    becomes:

    case "$CLUSTER" in
    bigCluster)
      subCommand="sqsub"
      ;;
    *)
      subCommand="qsub"
      ;;
    esac
    

    And then at the top of head.sh you would define:

    CLUSTER=bigCluster
    

    Once you have made these changes, you will not be required to make very much modifications in order to start a new run.

     

    How the iteration process occurs

    1. User creates two files.

    echo 0 > finished_test
    echo 0 > finished_next_start_time
    

    Where the finished_test file indicated that the 0th iteration has completed and the finished_next_start_time file indicates that the next round will start at 0ps.

    2. User creates a directory called md0_success and puts their starting .gro file there. The naming of the .gro file is important. Let's say your MYMOL (as defined at the top of head.sh) was:

    MYMOL=proteinInWater
    

    Then your md0_success directory must contain a file named proteinInWater_md0_deshuffleddesorted.gro

    $ls md0_success
    md0_success/proteinInWater_md0_deshuffleddesorted.gro
    

    The 'deshuffleddesorted' part of the name is required even if you do not use sorting or shuffling. It's just built into the script because the script has optional sorting and shuffling.

    3. Create your .top and .ndx files and name them as $MYMOL.top and $MYMOL.ndx

    4. Create your .mdp file and name it as $MYMOL.mdp

    4.a) The last line of your .mdp file must be ;EOF

    4.b) Your .mdp file must contain some flags that will be replaced using sed based on your options in head.sh. Those options that must be defined with special flags are these. Make sure your.mdp file contains lines that look exactly like this:

    cpp                 =  CPP
    nsteps              =  NSTEPS
    tinit               =  TINIT
    nstxout             =  NSTEPS
    nstvout             =  NSTEPS
    nstfout             =  NSTEPS
    nstxtcout           =  SAVE_FREQUENCY
    

    5. cp $MYMOL.mdp $MYMOL_posre.mdp and then modify the new file so that it will be used for any segment with N less than $NJUMPS_POSRE as defined in head.sh. Personally, the author sets the define for position restraints on the protein for the first segment and also sets gen_vel and unconstrained_start differently in the two .mdp files so that there is a generation of velocities on the first segment and a proper restart on other segments.

    Note: This script never uses tpbconv because it was designed for re-sorting. However, it should be a simple matter to modify it if you want to use tpbconv instead.

    6. chmod +x on all the scripts

     

    Location of the Scripts

    A) head.sh goes in your working directory along with .mdp .ndx .top files and the md0_success directory that contains the .gro file.

    B) sub.sh and sub.$CLUSTER.sh files go in /home/`whoami`/scripts/submission/ since many different jobs should all use the same sub.sh script. If you want to put the sub.sh scripts somwehere else, you will need to modify the bottom of head.sh where it submits the sub.sh job.

     

    Running the Script

    cd to your working directory and:

    nohup ./head.sh &
    

     

    Files Created During Runtime

    tmp.sub: the output from your submission that is used to capture job information

    wait_for_this_pid: the pid of the last job in the chain

    nohup.out: a list of submitted jobs with pids in the correct chained order

    DATA/ directory --> This stores your .xtc and your .edr files from N-3 and older where N is the currently running job that has not finished. Note that N-1 and N-2 directories contain all files (.log, .trr, .tpr, etc) but most of these files are deleted when archiving into the DATA directory. If you want to keep more files then you should modify the source code of sub.sh in the 'finished_test' section.

     

    Submitting more jobs to a chain that is already running

    Just run head.sh again. It is for this reason that you shouldn't delete the wait_for_this_pid file at any time. Note that if you do delete that file then you can use nohup.out to determine the pid of the last job in the chain and put that into wait_for_this_pid.

     

    The head.sh script

    $cat head.sh
    # Automated submission script
    # Chris Neale November 2007
    
    #!/bin/bash
    PATH=$PATH:.
    
    ##########################################################
    # Basic setup options:
    MYMOL=pagagg         # starting name for your .gro, .mdp, .top (etc) files
    REPLICA=1            # Just a flag to differentiate runs if you are running repeat identical simulations at once.
    MD=/my/directory     # working directory
    CLUSTER=cluster1     # a tag used to define cluster specific characteristics for ease of porting netween clusters
    NUM_TO_SUBMIT=1      # number of chained jobs to submit
    MYNP=4               # number of CPUs to occupy
    TJUMP=500            # Time (in ps) for a single step
    NJUMPS=1000          # Number of steps before dying and allowing a new chained job to start
    NFRAMES_IN_XTC=51    # Number of frames that you expect in the .xtc file = TJUMP/(SAVE_FREQUENCY*dt) + 1
    SAVE_FREQUENCY=5000  # Save the .xtc every this number of timesteps
    NJUMPS_POSRE=1       # The number of initial steps that will be done by position restraints
    
    
    ##########################################################
    # More advanced setup options:
    
    RUN_AS_A_TEST=0               # use a test queue
    NEVER_USE_SORT_SHUFFLE=0      # set to nonzero to avoid sorting and shuffling always
    RUNTIME_LIMIT=1w              # flag to apply a runtime limit to your job
    APPLY_RUNTIME_LIMIT=0         # set to nonzero to actually apply a runtime limit
    DOUBLE_CHECK_DESORT=0         # currently not implemented
    
    ##########################################################
    # Things below this line do not usually need to be changed
    
    cd ${MD}
    
    if((RUN_AS_A_TEST)); then
      # Override the previously setup options
      MYNP=2
      TJUMP=0.2
      NJUMPS=1
      NFRAMES_IN_XTC=2
      SAVE_FREQUENCY=100
      NJUMPS_POSRE=0
      testFlag="--test"
    else
      testFlag=""
    fi
    
    case "$CLUSTER" in
    *)
      if((APPLY_RUNTIME_LIMIT==1)); then
        timeFlag="-r $RUNTIME_LIMIT"
      else
        timeFlag=""
      fi
      ;;
    esac
    
    case "$CLUSTER" in
    *)
      subCommand="qsub"
      ;;
    esac
    
    if((MYNP>1)); then
      case "$CLUSTER" in
      *)
        queueIdent="-q mpi --nompirun"
        ;;
      esac
    else
      case "$CLUSTER" in
      *)
        queueIdent="";;
      esac
    fi
    
    for((i=0;i<NUM_TO_SUBMIT; i++)); do
    
      if [ -e DO_NOT_RUN ]; then
        echo "ERROR: file DO_NOT_RUN exists... exiting"
        exit
      fi
      
      if [ -e wait_for_this_pid ]; then
        pid=`cat wait_for_this_pid | awk '{print $1}'`
        case "$CLUSTER" in
        *)
          waitFlag="-w $pid"
          ;;
        esac
      else
        waitFlag=""
      fi
    
      case "$CLUSTER" in
      *)
        nameFlag="-N ${MYMOL}_${REPLICA}"
        mynpFlag="-n ${MYNP}"
        subName=".${CLUSTER}"
        ;;
      esac
    
      ${subCommand} ${testFlag} ${waitFlag} ${timeFlag} ${queueIdent} ${nameFlag} ${mynpFlag} -o `pwd`/out.${i} -e `pwd`/err.${i} /home/`whoami`/scripts/submission/sub${subName}.sh ${MD} ${MYNP} ${MYMOL} ${TJUMP} ${NJUMPS} ${NFRAMES_IN_XTC} ${SAVE_FREQUENCY} ${NJUMPS_POSRE} ${NEVER_USE_SORT_SHUFFLE} ${CLUSTER} ${DOUBLE_CHECK_DESORT} > tmp.sub 2>&1
      case "$CLUSTER" in
      *)
        pid=`tail -n 1 tmp.sub | awk '{print $4}'`
        ;;
      esac
      
      echo "Submitting ${subCommand} ${testFlag} ${waitFlag} ${timeFlag} ${queueIdent} ${nameFlag} ${mynpFlag} -o `pwd`/out.${i} -e `pwd`/err.${i} /home/`whoami`/scripts/submission/sub${subName}.sh ${MD} ${MYNP} ${MYMOL} ${TJUMP} ${NJUMPS} ${NFRAMES_IN_XTC} ${SAVE_FREQUENCY} ${NJUMPS_POSRE} ${NEVER_USE_SORT_SHUFFLE} ${CLUSTER} ${DOUBLE_CHECK_DESORT} -- received pid = $pid"
      echo "$pid" > wait_for_this_pid
      sleep 1
    
    done
    
    

     

    The sub.$CLUSTER.sh script

    $ cat sub.cluster1.sh 
    #!/bin/bash
    
    # This is a wrapper script so that you can define PATH, LD_LIBRARY_PATH
    # A wrapper is required if you need to define via "#$ -v PATH=", etc...
    
    /home/`whoami`/scripts/submission/sub.sh ${1} ${2} ${3} ${4} ${5} ${6} ${7} ${8} ${9} ${10}
    
    # If you have severe NFS delay, you may need to include this next line
    #sleep 60
    

     

    The sub.sh script

    $ cat sub.sh 
    # Automated submission script
    # Chris Neale November 2007
    
    echo -n "TIMING TEST (start): "
    date
    #!/bin/bash
    MD="$1"
    MYNP="$2"
    MYMOL="$3"
    TJUMP="$4"
    NJUMPS="$5"
    NFRAMES_IN_XTC="$6"
    SAVE_FREQUENCY="$7"
    NJUMPS_POSRE="$8"
    NEVER_USE_SORT_SHUFFLE="$9"
    CLUSTER="${10}"
    DOUBLE_CHECK_DESORT="${11}"
    
    case "$CLUSTER" in
    *)
      ED=/tools/gromacs-3.3.1/exec/bin
      PED=${ED}
      mpiLocation="/tools/openmpi/1.2.1"
      mpirunProg="${mpiLocation}/bin/mpirun"
      mdrun_mpiProg="mdrun_openmpi_v1.2.1"
    
      ###############################################
      # For LAM mpi                                 #
      # mpiLocation="/tools/lam/lam-7.1.2"          #
      # mpirunProg="${mpiLocation}/bin/mpirun C"    #
      # mdrun_mpiProg="mdrun_mpi"                   #
      ###############################################
    
      ;;
    esac
    
    case "$CLUSTER" in
    *)
      DD=/home/`whoami`/gromacs/template
      ;;
    esac
    
    # The NEW variable allows use of a special writing location (e.g ${TMPDIR} or /scratch/`whoami`)
    case "$CLUSTER" in
    *)
      NEW="."
      ;;
    esac
    
    case "$CLUSTER" in
    *)
      CPP="cpp"
      ;;
    esac
    
    PATH=$PATH:.
    
    cd ${MD}
    
    TINY_SLEEP=1
    SHORT_SLEEP=10
    LONG_SLEEP=60
    EXTENDED_SLEEP=300
    
    ###############################################
    # Startup tests:
    
    if [ -e DO_NOT_RUN ]; then
      echo "ERROR error: file DO_NOT_RUN exists... exiting"
      exit
    fi
    
    for((v=0;v<2;v++)); do
      num=0;
      if [ -e finished_grompp ]; then
        let "num=$num+1"
      fi
      if [ -e finished_mdrun ]; then
        let "num=$num+1"
      fi
      if [ -e finished_desort ]; then
        let "num=$num+1"
      fi
      if [ -e finished_test ]; then
        let "num=$num+1"
      fi
      
      if((num==0)); then
        echo "Unsure how to start the run. Check this out"
        echo "$ls -l finished_grompp finished_mdrun finished_desort finished_test"
        ls -l finished_grompp finished_mdrun finished_desort finished_test
        echo "Perhaps you forgot to set a finished_XXX file upon starting your run?"
        echo "  - Otherwise there seems to be an error in the script."
      else 
        if((num!=1)); then
          echo "Unsure how to start the run. Check this out"
          echo "$ls -l finished_grompp finished_mdrun finished_desort finished_test"
          ls -l finished_grompp finished_mdrun finished_desort finished_test
          echo "Only one of these files should have existed."
        fi
      fi
      if((num==1)); then
        break; 
      fi
      echo "Will sleep then try one more time"
      sleep {$LONG_SLEEP}
    done
    
    if((num!=1)); then
      echo "Could not resolve multiple x for finished_x problem. Exiting"
      touch ./DO_NOT_RUN
      exit
    fi
    
    ###############################################
    # Initializations:
    
    gromppProblemsInARow=0
    reverted=0
    MAX_CONSECUTIVE_GROMPP_ERRORS=2
    MAX_CONSECUTIVE_MDRUN_ERRORS=2
    
    function waitForExistNotEmpty {
      # First arg controls usage: 
      #   0  for existance test 
      #   1  for not empty test 
      #  -1  for must not exist test
      #   12 for not empty test plus require single value to equal third arg
      # Second arg is name of file/directory
      # Third arg is the expected single value in the file if First arg is = 12 
      #
      # Note: This overly complicated procedure is required for proper usage of a
      #       cluster where NFS delay can be significant and simple -s tests
      #       routinely fail to detect the fact that the file is empty as far as 
      #       val=`cat file` is concerned
      notEmpty="$1"
      case "$notEmpty" in
      1*)
        eneFlag="-s $2"
        ;;
      -1)
        eneFlag="! -e $2"
        ;;
      0)
        eneFlag="-e $2"
        ;;
      *)
        echo "ERROR error: incorrect argument to waitForExistNotEmpty = $notEmpty"
        exit
        ;;
      esac
    
      for((length=1;length<100;length++)); do
        if [ ${eneFlag} ]; then 
          break
        fi
        sleep ${length}
      done
      if((length==100)); then
        echo "ERROR error: Have slept for 30 minutes while waiting for the file $2 to meet conditions [ ${eneFlag} ] ... is there a problem in your script or is the NFS delay very very large?"
      fi
    
      # for all uses other than First Arg = 12 this function is over
      if((notEmpty!=12)); then
        return
      fi
      expectedVal="$3"
      for((length=1;length<100;length++)); do
        currentVal=`cat $2`
        if [ -n "$currentVal" ]; then
          # make sure variable is non-empty before making the comparison
          case "$currentVal" in
          $expectedVal)
            echo "NOTE: breaking from loop since currentVal($currentVal)=expectedVal($expectedVal)"
            break
            ;;
          esac
        fi
        sleep ${length}
      done
      if((length==100)); then
        echo "ERROR error: Have slept for 30 minutes while waiting for the file $2 to meet conditions `cat $2 = $expectedVal` ... is there a problem in your script or is the NFS delay very very large?"
      fi
    }
    
    ###############################################
    # The main loop:
    
    for ((njump=0;njump<NJUMPS;njump++)); do
    
      # GROMPP (WITH SORTING)
      if [ -e finished_test ]; then
        NIN=`cat finished_test`
        TINIT=`cat finished_next_start_time`
        let "NOUT=$NIN+1"
        DIR=md${NOUT}_running
        PREV=md${NIN}_success
        if [ ! -e ${PREV} ]; then
          echo "There was some problem. Expected ${PREV} to exist, but it does not"
          touch ./DO_NOT_RUN
          exit
        fi
        if [ ! -e ${NEW}/${DIR} ]; then
          mkdir ${NEW}/${DIR}
        fi
    
        nsteps=`echo "$TJUMP/0.002" | bc -l | awk -F '.' '{print $1}'`
        if((NOUT<=NJUMPS_POSRE)); then
          sed "s/TINIT/${TINIT}/" ${MYMOL}_posre.mdp | sed "s/NSTEPS/${nsteps}/" | sed "s/SAVE_FREQUENCY/${SAVE_FREQUENCY}/" | sed "s/CPP/${CPP}/" > ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp
          posreFlag="-r md0_success/${MYMOL}_md0_deshuffleddesorted.gro"
        else
          sed "s/TINIT/${TINIT}/" ${MYMOL}.mdp | sed "s/NSTEPS/${nsteps}/" | sed "s/SAVE_FREQUENCY/${SAVE_FREQUENCY}/" | sed "s/CPP/${CPP}/" > ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp
          posreFlag=""
        fi
        waitForExistNotEmpty 1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp
        if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${TINY_SLEEP}; fi
        if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${SHORT_SLEEP}; fi
        if [ ! `tail -1 ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp | awk '{print $1}'` = ";EOF" ]; then sleep ${LONG_SLEEP}; fi
    
        if((NIN!=0)); then
          if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then
            # Can not do shuffle/sort with posre
            ${ED}/grompp -np ${MYNP} -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -t ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.trr -p ${MYMOL}.top -n ${MYMOL}.ndx -e ${PREV}/${MYMOL}_md${NIN}.edr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
            rm mdout.mdp &
          else
            # Shuffle the .trr input file correctly. Assume that it is not currently shuffled
            ${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}_a.ndx 
            rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}_a.ndx mdout.mdp &
            echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro
            # g_desort -f original shuffled will unshuffle, therefore g_desort -f shuffled original will REshuffle
            ${DD}/g_desort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -o ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx -n 6
            ${ED}/trjconv -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.trr -o ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr -n ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx
            # Create the run input file
            ${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -t ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr -p ${MYMOL}.top -n ${MYMOL}.ndx -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx -e ${PREV}/${MYMOL}_md${NIN}.edr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
            rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx mdout.mdp &
    
            # In the future: implement this check. Note that this will require rethinking the _a postfixes
            #                since the files without the _a postfixes are the ones that I actually use
            #if((DOUBLE_CHECK_DESORT!=0)); then
    
              # If a new reshuffle.ndx file differs then the run is invalid.
              echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro
              ${DD}/g_desort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -o ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx -n 6
              case "$CLUSTER" in
              *)
                look=`diff -q ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx`
                ;;
              esac
              if [ -n "$look" ]; then
                echo There was a big problem. ${NEW}/${DIR}/reshuffleresort_md${NOUT}_a.ndx and ${NEW}/${DIR}/reshuffleresort_md${NOUT}.ndx are different.
                mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_notValid.tpr
              fi
    
            # End of the new test
            #fi
    
            #Create the deshuffle file to properly handle the next run
            ${DD}/g_desort -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro -o ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -n 6
            rm -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_a.tpr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit_a.gro ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}_a.ndx ${PREV}/${MYMOL}_md${NIN}_reshuffleresort.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro ${NEW}/${DIR}/reshuffleresort${MYMOL}_md${NOUT}.ndx mdout.mdp &
          fi  
        else
          if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then
            # Can not do shuffle/sort with posre
            ${ED}/grompp -np ${MYNP} -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
            rm mdout.mdp &
          else
            ${ED}/grompp -np ${MYNP} -shuffle -sort -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.mdp ${posreFlag} -c ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro -p ${MYMOL}.top -n ${MYMOL}.ndx -deshuf ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr
            rm -f ${NEW}/${DIR}/deshuffle_md${NOUT}.ndx mdout.mdp &
            echo System | ${ED}/editconf -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro
            ${DD}/g_desort -f ${PREV}/${MYMOL}_md${NIN}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro -o ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -n 6
            rm -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_shuffledsortedInit.gro &
          fi
        fi
    
        if [ -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then
          echo ${NOUT} > finished_grompp
          waitForExistNotEmpty 12 finished_grompp ${NOUT}
          rm -f finished_test
          gromppProblemsInARow=0
        else
          # can not revert
          let "gromppProblemsInARow=$gromppProblemsInARow+1"
          if((gromppProblemsInARow>MAX_CONSECUTIVE_GROMPP_ERRORS)); then
            touch ./DO_NOT_RUN
            exit
          fi
        fi
    
      fi
    
      # MDRUN
      if [ -e finished_grompp ]; then
        NOUT=`cat finished_grompp`
        TINIT=`cat finished_next_start_time`
        DIR=md${NOUT}_running
    
        # Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR}
        if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then
          sleep ${SHORT_SLEEP}
          if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr ]; then
            rm -f finished_grompp
            echo "ERROR error: finished_grompp existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr did not exist"
            if((reverted==0)); then
              echo "       reverting to finished_test"
              reverted=1
              let "NIN=$NOUT-1"
              echo "$NIN" > finished_test
              if [ ! -s finished_test ]; then sleep ${SHORT_SLEEP}; fi
              continue
            else
              touch ./DO_NOT_RUN
              exit
            fi
          else
            reverted=0;
          fi
        fi
    
        if((MYNP==1)); then
          returnValue=`${ED}/mdrun -deffnm ${NEW}/${DIR}/${MYMOL}_md${NOUT}`
        else
          case "$CLUSTER" in
          *)
            returnValue=`${mpirunProg} ${PED}/${mdrun_mpiProg} -np ${MYNP} -deffnm ${NEW}/${DIR}/${MYMOL}_md${NOUT}`
            ;;
          esac
        fi
        if((returnValue!=0)); then
          echo "ERROR error: mpirun for mdrun_mpi returned non-zero (${returnValue}). Exiting"
          exit
        fi
        echo ${NOUT} > finished_mdrun
        waitForExistNotEmpty 12 finished_mdrun ${NOUT}
        rm -f finished_grompp
      fi
    
      # DESORT
      if [ -e finished_mdrun ]; then
        NOUT=`cat finished_mdrun`
        TINIT=`cat finished_next_start_time`
        DIR=md${NOUT}_running
    
        # Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR}
        if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc ]; then
          rm -f finished_mdrun
          echo "ERROR error: finished_mdrun existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc did not exist"
          if((reverted==0)); then
            echo "       reverting to finished_test"
            reverted=1
            let "NIN=$NOUT-1"
            echo "$NIN" > finished_test
            waitForExistNotEmpty 12 finished_test ${NIN}
            continue
          else
            touch ./DO_NOT_RUN
            exit
          fi
        else
          reverted=0;
        fi
    
        if((NOUT<=NJUMPS_POSRE || NEVER_USE_SORT_SHUFFLE==1 || MYNP==1)); then
          # Can not do shuffle/sort with posre
          mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc
          mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr
          mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro
        else
          echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.xtc -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc &
          echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.trr -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr &
          echo System | ${ED}/trjconv -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}.gro -s ${NEW}/${DIR}/${MYMOL}_md${NOUT}.tpr -n ${NEW}/${DIR}/deshuffledesort${MYMOL}_md${NOUT}.ndx -o ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro &
          # Wait for all 3 desorts to finish
          wait
        fi
    
        echo ${NOUT} > finished_desort
        waitForExistNotEmpty 12 finished_desort ${NOUT}
        rm -f finished_mdrun
      fi
    
      # TEST
      if [ -e finished_desort ]; then
        TINIT=`cat finished_next_start_time`
        runHasNoErrors=1
        NOUT=`cat finished_desort`
        DIR=md${NOUT}_running
    
        # Reversion is important in cases where a crash or time overrun leads to loss of data in ${TMPDIR}
        if [ ! -e ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ]; then
          rm -f finished_desort
          echo "ERROR error: finished_desort existed for NOUT=$NOUT but ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc did not exist"
          if((reverted==0)); then
            echo "       reverting to finished_test"
            reverted=1
            let "NIN=$NOUT-1"
            echo "$NIN" > finished_test
            waitForExistNotEmpty 12 finished_test ${NIN}
            continue
          else
            touch ./DO_NOT_RUN
            exit
          fi
        else
          reverted=0;
        fi
    
        ${ED}/gmxcheck -f ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc 2> ${NEW}/${DIR}/checkXTC
        # Ensure no magic number error
        magicNumberError=`grep Error ${NEW}/${DIR}/checkXTC | wc -l`
        if((magicNumberError==1));then
          mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.xtc
          mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.trr
          mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_magicNumberError.gro
          runHasNoErrors=0
        fi
        ## Ensure expected number of frames is reached -- this is system and mdp file specific
        numFramesXTC=`grep "^Step" ${NEW}/${DIR}/checkXTC | awk '{print $2}'`
        if((numFramesXTC!=NFRAMES_IN_XTC)); then
          mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.xtc ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.xtc
          mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.trr ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.trr
          mv ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted.gro ${NEW}/${DIR}/${MYMOL}_md${NOUT}_deshuffleddesorted_incompleteFrames.gro
          runHasNoErrors=0
        fi
        if((runHasNoErrors)); then
          mv ${NEW}/${DIR} md${NOUT}_success
          waitForExistNotEmpty -1 ${NEW}/${DIR}
          while [ -e ${NEW}/${DIR} ]; do
            mv ${NEW}/${DIR} md${NOUT}_success
            sleep ${LONG_SLEEP}
          done
          
          echo ${NOUT} > finished_test
          nextTime=`echo "${TINIT}+${TJUMP}" | bc -l`
          echo ${nextTime} > finished_next_start_time
          waitForExistNotEmpty 12 finished_test ${NOUT}
          waitForExistNotEmpty 12 finished_next_start_time ${nextTime} 
    
          # Now do some clean up operations
          let "NCLEAN=$NOUT-2"
          if [ -e md${NCLEAN}_success ]; then
            if [ ! -e DATA ]; then
              mkdir DATA
            fi
            if((NCLEAN!=0)); then
              # Don't remove or modify the starting directory
              mkdir DATA/md${NCLEAN}_success
              mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc DATA/md${NCLEAN}_success
              waitForExistNotEmpty -1 md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc
              while [ -e md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc ]; do
                mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}_deshuffleddesorted.xtc DATA/md${NCLEAN}_success
                sleep ${LONG_SLEEP}
              done
              mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr DATA/md${NCLEAN}_success
              waitForExistNotEmpty -1 md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr
              while [ -e md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr ]; do
                mv md${NCLEAN}_success/${MYMOL}_md${NCLEAN}.edr DATA/md${NCLEAN}_success
                sleep ${LONG_SLEEP}
              done
              rm -rf md${NCLEAN}_success &
            fi
          fi
        else
          # Send the run back to do the mdrun
          for((i=1;i<=MAX_CONSECUTIVE_MDRUN_ERRORS;i++)); do
            if [ ! -e md${NOUT}_failure${i} ]; then
              break
            fi
          done
          mv ${NEW}/${DIR} md${NOUT}_failure${i}
          if((i>MAX_CONSECUTIVE_MDRUN_ERRORS)); then
            echo "Too many failure for run $NOUT"
            touch ./DO_NOT_RUN
            exit
          fi
          # Send it back by setting as if grompp just finished
          mkdir ${NEW}/${DIR} 
          mv md${NOUT}_failure${i}/${MYMOL}_md${NOUT}.tpr md${NOUT}_failure${i}/deshuffledesort${MYMOL}_md${NOUT}.ndx ${NEW}/${DIR}
          echo ${NOUT} > finished_grompp
          waitForExistNotEmpty 12 finished_grompp ${NOUT}
        fi
        rm -f finished_desort
      fi
    
      wait
    done 
    
    wait
    
    
    echo -n "TIMING TEST (end): "
    date
    
    
    Page last modified 14:16, 21 Sep 2010 by hess