Gromacs

Performance checklist

    There are many different aspects that affect the performance of simulations in Gromacs. Most simulations require a lot of computational resources, therefore it can be worthwhile to optimized the use of those resources. Several issues mentioned in the list below could lead to a performance difference of a factor of 2. So it can be useful go through the checklist. The points about compilers/CUDA and the Verlet scheme only apply to Gromacs version 4.6 or later, the other points apply to all versions.
     

    Gromacs configuration:

    • Don't use double precision unless you're absolute sure you need it.
    • Compile the FFTW library (yourself) with SSE support on x86.
    • On x86, use icc or gcc as compiler (not pgi or the Cray compiler).
    • Use a new compiler version, especially for gcc (4.6 and 4.7 improved a lot!).
    • MPI library: OpenMPI usually has good performance and causes little trouble.
    • From GROMACS version 4.6: make sure your compiler supports OpenMP.
    • With GPUs:
      • configure with -DGMX_GPU=ON;
      • use CUDA 5.0 or newer (7.0 for Maxwell);
        • the fastest CUDA versions with GROMACS are 5.5, 6.5 and 7.5.
      • use with a recent CUDA driver.
    • If compiling on a cluster head node, make sure that GMX_CPU_ACCELERATION is appropriate for the compute nodes.

    Run setup:

    • For an approximately spherical solute, use a rhombic dodecahedron unit cell.
    • You can increase the time-step to 4 or 5 fs when using virtual interaction sites (pdb2gmx -vsite h).
    • At moderate to high parallelization, use the Verlet cut-off scheme (mdp option: cutoff-scheme = Verlet) for better performance, due to less load imbalance.
    • To be able to use GPUs, you have to use the Verlet cut-off scheme.
    • To quickly test the performance on GPUs/Verlet-scheme, use can use "mdrun -testverlet" with an old tpr file. This should not be used for production work.
    • For massively parallel runs with PME, you might need to try different numbers of PME nodes (mdrun -npme ???) to achieve best performance. g_tune_pme can help automate this search
    • For massively parallel runs (also mdrun -multi), or with a slow network, global communication can become a bottleneck and you can reduce it with mdrun -gcom (note that this does affect the frequency of temperature and pressure coupling).

    Checking and improving performance:

    •  Look at the end of the md.log file to see the performance and the cycle counters and wall-clock time for different parts of the MD calculation. The PP/PME load ratio is also printed, with a warning when a lot of performance is lost due to imbalance.
    • Adjust the number of PME nodes and/or the cut-off and PME grid-spacing when there is a large PP/PME imbalance. Note that even with a small reported imbalance, the automated PME-tuning might have reduced the initial imbalance. You could still gain performance by changing the mdp parameters or increasing the number of PME nodes.
    • If the neighbor searching takes a lot of time, increase nstlist (with the Verlet cut-off scheme, this automatically adjusts the size of the neighbour list to do more non-bonded computation to keep energy drift constant).
    • If "Comm. energies" takes a lot of time (a note will be printed in the log file), increase nstcalcenergy or use mdrun -gcom.
    • If all communication takes a lot of time, you might be running on too many cores, or you could try running combined MPI/OpenMP parallelization with 2 or 4 OpenMP threads per MPI process.
    Page last modified 11:57, 30 Sep 2015 by metere