GPU acceleration

    Version as of 06:32, 27 Nov 2020

    to this version.

    Return to Version archive.

    View current version

    erc_logojpg.jpgGromacs version 4.6 and later include a brand-new native GPU acceleration developed in Stockholm under the framework of a grant from the European Research Council (#209825), with heroic efforts in particular by Szilard Pall. This replaces all previous trial GPU code, and comes with a number of important features:


    • The new GPU code is fast, and we mean it. Rather than speaking about relative speed, or speedup for a few special cases, this code is typically much faster (3-5x) even when compared to Gromacs running on all cores of a typical desktop. If you put two GPUs in a high-end cluster node, this too will result in a significant acceleration.
    • We have designed a new architecture where we use both CPUs and GPUs for the calculation.
      • This means we support a much wider range of settings with GPUs - pretty much any interactions based on reaction-field or PME work.
      • It also means we can use multiple GPUs efficiently, and the GPU acceleration works in combination with Gromacs' domain decomposition and load balancing code too, for arbitrary triclinic cells.
      • By using the CPU for part of the calculation, we retain full support for virtual interaction sites and other speedup techniques - you can use GPUs with very long timesteps and maintain accuracy (we're not simply making the hydrogens heavy, but properly removing selected internal degrees of freedom).
    • GPU acceleration is now a core part of Gromacs - as long as you have the Cuda development libraries installed it will be enabled automatically during Gromacs configuration.


    The underlying idea of the new GPU acceleration is the core md loop, as illustrated below. There are lots of things that are computed every step, but the most compute-intensive part is the nonbonded force calculation. The biggest assett of Gromacs - and our biggest challenge - is that this iteration is already very fast when running on CPUs in Gromacs, sometimes in the order of half a millisecond. Incidentally, this is why it has been so difficult to get a substantial GPU speedup in Gromacs - if each step took 100ms it would be trivial to speed up (which is why some codes show amazing relative speed). 




    Thus, the idea in Gromacs-GPU (starting with release 4.6) is that we offload the heavy nonbonded force calculation to an accelerator (either a GPU or something else), while the CPU does bonded forces and lattice summation (PME) in the mean time. While this might sound straightforward, it turned into a number of major challenges since we needed to create new algorithms for efficient force calculation. This in turn required us to come up with a new way to generate structures that describe the proximity of particles, rather than simply using old-fashioned neighborlists - in particular since we also wanted this to work in parallel. In the end, it turns out to be a pretty advanced new flowchart where the main MD loop runs both on the CPU and GPU:




    The good news is that all this is fairly easy to use! The new neighbor-structure required us to introduce a new variable called "cutoff-scheme" in the mdp file. The old Gromacs settings corresponds to the value "group", while you must switch this to "verlet" to use GPU acceleration. You can also do this on the mdrun level for an old TPR file by using the command-line option "-testverlet".  This new cutoff scheme is significantly more accurate than the old one; you can set the neighborsearching radius rlist to any value you want, and Gromacs will automatically expand buffers to ensure that the energy drift stays low. 

    The top part of any log file in Gromacs will describe the configuration, and in particular whether your version has GPU support compiled-in. If this is the case, Gromacs will automatically use any GPUs it finds. However, since we use both CPUs and GPUs we rely on a reasonable balance between CPU and GPU performance (although we can balance this load between them to some extent).


    Example performance

    One of the critical features of Gromacs-4.6 is that GPU acceleration works in combination with all the other advanced features to improve simulation performance, so we have tried to showcase systems that make use of triclinic unit cells, virtual sites, etc - and also show what improvements these features lead to.


    Influence of box geometry and virtual interaction sites

    This is a simulation of the protein RNAse, which contains roughly 24,000 atoms in a cubic box. This type of box adds a lot of unnecessary water in the corners; a rhombic dodecahedron box is more spherical and only requires 17,000 atoms to achieve the same separation between periodic copies, and as evident below it leads to substantial performance improvements. The 6-core CPU in this case is an Intel Core i7-3930, while the dual 8-core system has Intel Xeon E5-2690 processors. Note that Gromacs pure CPU-performance is close (or in some cases even faster than) some other GPU-accelerated MD implementations.




    The smaller the simulated systems are, the harder they are to accelerate since the amount of computation per step is even more limited. To showcase an example of the performance levels attainable for a very small system we have used the Villin headpiece, with full PME electrostatics, in a small rhombic dodecahedron box. This system can reach a microsecond per day on a low-end desktop that costs around $1000, with much more accurate settings (i.e., PME and long cutoffs) than used in the first simulation by Duan & Kollman in 1998 (which required several months on a supercomputer):


    The work that went into the GPU acceleration actually also improved CPU performance significantly when comparing release 4.6 with release 4.5. Here's an example of a membrane protein, the ligand-gated ion channel GLIC embedded in a lipid bilayer and surrounded by water and ions, totaling about 150,000 atoms. This system uses 2fs time steps, and got almost 40% better performance in Gromacs-4.6, even when staying completely on the CPU. Nevertheless, the new GPU performance leaves this far behind - it is roughty five-fold faster than Gromacs-4.5:



    The offloading architecture also enables Gromacs to employ GPU acceleration in parallel simulations using multiple nodes. This part of Gromacs is still undergoing heavy development, so it will improve further in future versions, but already the present release is very useable. As an example, we use the alcohol dehydrogenase protein (134,000 atoms) simulated with 2fs steps on a Cray XK7, using either CPU-only or CPU+GPU nodes. Note that this is still a relatively small simulation; the GPU-nodes provide much better performance in the low end (typically 3.5X faster than the CPU version), although the CPU-only version catches up when we are entirely limited by the communication. On the right side of the plot, the simulation has scaled to the point where we only have ~130 atoms for each CPU core. All simulations use PME; the designation about PME tasks (or not) describe whether we use separate PME tasks in the setup:




    In summary, the new Gromacs GPU implementation in release-4.6 is very much ready for general consumption, and going forward it will be the basis for all our acceleration work. You do not have to maintain a separate GPU-enabled version or really think about it at all; just try to enable GPU support when you are compiling Gromacs, remember to say "cutoff-scheme=verlet" in your mdp file, and it should just work. The log files will describe your performance, and also whether there are imbalances in usage between your CPU and GPU resources. There are some extra tricks to improve performance on nodes with lots of cores (e.g. Opteron machines with 32-64 cores and just 1-2 GPUs) - post to the mailing list if you are having problems!












    Page last modified 09:24, 4 Nov 2013 by lindahl