Gromacs

GPU acceleration

    Table of contents
    No headers

    Version as of 04:49, 19 Oct 2019

    to this version.

    Return to Version archive.

    View current version

    eleerc_logojpg.jpgGromacs version 4.6 and later include a brand-new native GPU acceleration developed in Stockholm under the framework of a grant from the European Research Council (#209825), with heroic efforts in particular by Szilard Pall. This replaces all previous trial GPU code, and comes with a number of important features:

     

    • The new GPU code is fast, and we mean it. Rather than speaking about relative speed, or speedup for a few special cases, this code is typically much faster (3-5x) even when compared to Gromacs running on all cores of a typical desktop. If you put two GPUs in a high-end cluster node, this too will result in a significant acceleration.
    • We have designed a new architecture where we use both CPUs and GPUs for the calculation.
      • This means we support a much wider range of settings with GPUs - pretty much any interactions based on reaction-field or PME work.
      • It also means we can use multiple GPUs efficiently, and the GPU acceleration works in combination with Gromacs' domain decomposition and load balancing code too, for arbitrary triclinic cells.
      • By using the CPU for part of the calculation, we retain full support for virtual interaction sites and other speedup techniques - you can use GPUs with very long timesteps and maintain accuracy (we're not simply making the hydrogens heavy, but properly removing selected internal degrees of freedom).
    • GPU acceleration is now a core part of Gromacs - as long as you have the Cuda development libraries installed it will be enabled automatically during Gromacs configuration.

     

    The underlying idea of the new GPU acceleration is the core md loop, as illustrated below. There are lots of things that are computed every step, but the most compute-intensive part is the nonbonded force calculation. The biggest assett of Gromacs - and our biggest challenge - is that this iteration is already very fast when running on CPUs in Gromacs, sometimes in the order of half a millisecond. Incidentally, this is why it has been so difficult to get a substantial GPU speedup in Gromacs - if each step took 100ms it would be trivial to speed up (which is why some codes show amazing relative speed). 

     

    gmx_4.6_gpu_acceleration.png

     

    Thus, the idea in Gromacs-GPU (starting with release 4.6) is that we offload the heavy nonbonded force calculation to an accelerator (either a GPU or something else), while the CPU does bonded forces and lattice summation (PME) in the mean time. While this might sound straightforward, it turned into a number of major challenges since we needed to create new algorithms for efficient force calculation. This in turn required us to come up with a new way to generate structures that describe the proximity of particles, rather than simply using old-fashioned neighborlists - in particular since we also wanted this to work in parallel. In the end, it turns out to be a pretty advanced new flowchart where the main MD loop runs both on the CPU and GPU:

     

    gmx_4.6_gpu_acceleration_flow_parallel.png

     

    The good news is that all this is fairly easy to use! The new neighbor-structure required us to introduce a new variable called "cutoff-scheme" in the mdp file. The old G

    Page last modified 14:05, 4 Dec 2013 by pszilard