GPU acceleration

    Table of contents
    1. 1. Example performance

    Version as of 05:45, 27 Nov 2020

    to this version.

    Return to Version archive.

    View current version

    erc_logojpg.jpgGromacs version 4.6 and later include a brand-new native GPU acceleration developed in Stockholm under the framework of a grant from the European Research Council (#209825), with heroic efforts in particular by Szilard Pall. This replaces all previous trial GPU code, and comes with a number of important features:


    • The new GPU code is fast, and we mean it. Rather than speaking about relative speed, or speedup for a few special cases, this code is typically much faster (3-5x) even when compared to Gromacs running on all cores of a typical desktop. If you put two GPUs in a high-end cluster node, this too will result in a significant acceleration.
    • We have designed a new architecture where we use both CPUs and GPUs for the calculation.
      • This means we support a much wider range of settings with GPUs - pretty much any interactions based on reaction-field or PME work.
      • It also means we can use multiple GPUs efficiently, and the GPU acceleration works in combination with Gromacs' domain decomposition and load balancing code too, for arbitrary triclinic cells.
      • By using the CPU for part of the calculation, we retain full support for virtual interaction sites and other speedup techniques - you can use GPUs with very long timesteps and maintain accuracy (we're not simply making the hydrogens heavy, but properly removing selected internal degrees of freedom).
    • GPU acceleration is now a core part of Gromacs - as long as you have the Cuda development libraries installed it will be enabled automatically during Gromacs configuration.


    The underlying idea of the new GPU acceleration is the core md loop, as illustrated below. There are lots of things that are computed every step, but the most compute-intensive part is the nonbonded force calculation. The biggest assett of Gromacs - and our biggest challenge - is that this iteration is already very fast when running on CPUs in Gromacs, sometimes in the order of half a millisecond. Incidentally, this is why it has been so difficult to get a substantial GPU speedup in Gromacs - if each step took 100ms it would be trivial to speed up (which is why some codes show amazing relative speed). 




    Thus, the idea in Gromacs-GPU (starting with release 4.6) is that we offload the heavy nonbonded force calculation to an accelerator (either a GPU or something else), while the CPU does bonded forces and lattice summation (PME) in the mean time. While this might sound straightforward, it turned into a number of major challenges since we needed to create new algorithms for efficient force calculation. This in turn required us to come up with a new way to generate structures that describe the proximity of particles, rather than simply using old-fashioned neighborlists - in particular since we also wanted this to work in parallel. In the end, it turns out to be a pretty advanced new flowchart where the main MD loop runs both on the CPU and GPU:




    The good news is that all this is fairly easy to use! The new neighbor-structure required us to introduce a new variable called "cutoff-scheme" in the mdp file. The old Gromacs settings corresponds to the value "group", while you must switch this to "verlet" to use GPU acceleration. You can also do this on the mdrun level for an old TPR file by using the command-line option "-testverlet".  This new cutoff scheme is significantly more accurate than the old one; you can set the neighborsearching radius rlist to any value you want, and Gromacs will automatically expand buffers to ensure that the energy drift stays low. 

    The top part of any log file in Gromacs will describe the configuration, and in particular whether your version has GPU support compiled-in. If this is the case, Gromacs will automatically use any GPUs it finds. However, since we use both CPUs and GPUs we rely on a reasonable balance between CPU and GPU performance (although we can balance this load between them to some extent).


    Example performance

    One of the critical features of Gromacs-4.6 is that GPU acceleration works in combination with all the other advanced features to improve simulation performance, so we have tried to showcase systems that make use of triclinic unit cells, virtual sites, etc - and also show what improvements these features lead to.








    Page last modified 08:44, 4 Nov 2013 by lindahl