GPU acceleration


    Starting with version 4.6, GROMACS includes a brand-new, native GPU acceleration developed in Stockholm under the framework of a grant from the European Research Council (#209825), with heroic efforts in particular by Szilárd Páll and Berk Hess. This replaces the previous GPU code and comes with a number of important features:

    • The new GPU code is fast, and we mean it. Rather than speaking about relative speed, or speedup for a few special cases, this code is typically much faster (3-5x) even when compared to GROMACS running on all cores of a typical desktop. If you put two GPUs in a high-end cluster node, this too will result in a significant acceleration.
    • We have designed new algorithms targeting SIMD/streaming architectures as well as new parallelization architecture where we use both CPUs and GPUs for the calculation. This means:
      • Our accelerated (non-bonded) algorithms are not just "ports" or "tweaked" version of the well known standard, decades old algorithms (i.e. Verlet list + linked cell list) , but complete algorithmic redesig providing inherently efficient execution on modern processors (both CPUs and GPUs).
      • We support a much wider range of settings with GPUs - pretty much any interactions based on reaction-field or PME work.
      • We can use multiple GPUs efficiently, and the GPU acceleration works in combination with GROMACS' domain decomposition and load balancing code too, for arbitrary triclinic cells.
      • By using the CPU for part of the calculation, we retain full support for virtual interaction sites and other speedup techniques - you can use GPUs with very long timesteps and maintain accuracy (we're not simply making the hydrogens heavy, but properly removing selected internal degrees of freedom).
    • GPU acceleration is now a core part of GROMACS - as long as you have the Cuda development libraries installed it will be enabled automatically during Gromacs configuration.


    As we move to Gromacs-5.0 in a few weeks, the GPU setup will be even easier. The settings that were previously a bit special for GPUs are becoming the new standard (simply because they are much more accurate, they scale much better, and even on a single core they are almost as fast as the old setup - and once you run on 8 cores or so on a workstation the scaling benefits kick in). This will also make the algorithms used for the GPU code path the main way forward for the project - don't think of it is a vendor-specific acceleration branch version!


    If you are new to GPU acceleration, you might be hesitant to shell out funds before you've seen it in action. Luckily, NVIDIA provides a free test-drive cluster where you can test their very latest high-end accelerators such as the Tesla K40 - have a look at .


    The other alternative is to simply get a normal desktop GeForce card for testing. NVIDIA is generous enough that they have not handicapped GPU computing capabilities (much) even on consumer cards, which makes it a great way to get started without investing very large funds. You get the full performance if you buy a good game card (but don't bother with the extreme cards that really contain two GPUs, they are problematic to use efficiently), but there are some things you lose compared to the professional hardware: The Tesla cards are extensively quality tested, and you get support directly from NVIDIA, while the GeForce cards are consumer-class, and they _occasionally_ break, in which case you need to take it up with the place you got it from. Speaking from experience, it is difficult to use GeForce cards in rackmount nodes. Since they have built-in fans (rather than using the fan in the rackmount node) they need more space (typically a 4U box), and even then we've had lots of problems with not having enough room for power connectors, etc.  Basically, if you want the best possible cost efficiency and don't mind a hack now and then, GeForce might serve you quite well, but if you operate a supercomputing center you will likely be happier with Tesla cards in the long run (and no, NVIDIA does not pay us to say this - we use both types of cards ourselves for different servers).


    The underlying idea of the new GPU acceleration is the core md loop, as illustrated below. There are lots of things that are computed every step, but the most compute-intensive part is the nonbonded force calculation. The biggest assett of GROMACS- and our biggest challenge - is that this iteration is already very fast when running on CPUs in GROMACS, sometimes in the order of half a millisecond. Incidentally, this is why it has been so difficult to get a substantial GPU speedup in GROMACS - if each step took 100ms it would be trivial to speed up (which is why some codes show amazing relative speed). 




    Thus, the idea behind the native GPU acceleration in GROMACS is that we offload the heavy nonbonded force calculation to an accelerator (either a GPU or something else), while the CPU does bonded forces and lattice summation (PME) in the mean time. While this might sound straightforward, it turned into a number of major challenges since we needed to create new algorithms for efficient force calculation. This in turn required us to come up with a new way to generate structures that describe the proximity of particles, rather than simply using old-fashioned neighborlists - in particular since we also wanted this to work in parallel. In the end, it turns out to be a pretty advanced new flowchart where the main MD loop runs both on the CPU and GPU:




    The good news is that all this is fairly easy to use! The new neighbor-structure required us to introduce a new variable called "cutoff-scheme" in the mdp file. The old Gromacs settings corresponds to the value "group", while you must switch this to "verlet" to use GPU acceleration. You can also do this on the mdrun level for an old TPR file by using the command-line option "-testverlet".  This new cutoff scheme is significantly more accurate than the old one; you can set the neighborsearching radius rlist to any value you want, and Gromacs will automatically expand buffers to ensure that the energy drift stays low. 

    The top part of any log file in Gromacs will describe the configuration, and in particular whether your version has GPU support compiled-in. If this is the case, Gromacs will automatically use any GPUs it finds. However, since we use both CPUs and GPUs we rely on a reasonable balance between CPU and GPU performance (although we can balance this load between them to some extent).


    Example performance

    One of the critical features of GROMACS 4.6 is that GPU acceleration works in combination with all the other advanced features to improve simulation performance, so we have tried to showcase systems that make use of triclinic unit cells, virtual sites, etc - and also show what improvements these features lead to.


    Influence of box geometry and virtual interaction sites

    This is a simulation of the protein RNAse, which contains roughly 24,000 atoms in a cubic box. This type of box adds a lot of unnecessary water in the corners; a rhombic dodecahedron box is more spherical and only requires 17,000 atoms to achieve the same separation between periodic copies, and as evident below it leads to substantial performance improvements. The 6-core CPU in this case is an Intel Core i7-3930, while the dual 8-core system has Intel Xeon E5-2690 processors. Note that the GROMACS pure CPU-performance is close (or in some cases even faster than) some other GPU-accelerated MD implementations.




    The smaller the simulated systems are, the harder they are to accelerate since the amount of computation per step is even more limited. To showcase an example of the performance levels attainable for a very small system we have used the Villin headpiece, with full PME electrostatics, in a small rhombic dodecahedron box. This system can reach a microsecond per day on a low-end desktop that costs around $1000, with much more accurate settings (i.e., PME and long cutoffs) than used in the first simulation by Duan & Kollman in 1998 (which required several months on a supercomputer):


    The work that went into the GPU acceleration actually also improved CPU performance significantly when comparing release 4.6 with release 4.5. Here's an example of a membrane protein, the ligand-gated ion channel GLIC embedded in a lipid bilayer and surrounded by water and ions, totaling about 150,000 atoms. This system uses 2fs time steps, and got almost 40% better performance in Gromacs-4.6, even when staying completely on the CPU. Nevertheless, the new GPU performance leaves this far behind - it is roughty five-fold faster than Gromacs-4.5:



    The offloading architecture also enables Gromacs to employ GPU acceleration in parallel simulations using multiple nodes. This part of Gromacs is still undergoing heavy development, so it will improve further in future versions, but already the present release is very useable. As an example, we use the alcohol dehydrogenase protein (ADH) solvated and set up in a rectagular box (134,000 atoms), simulated with 2fs steps on a Cray XK7, using either CPU-only or CPU+GPU nodes. Note that this is still a relatively small simulation; the GPU-nodes provide much better performance in the low end (typically 3.5X faster than the CPU version), although the CPU-only version catches up when we are entirely limited by the communication. On the right side of the plot, the simulation has scaled to the point where we only have ~130 atoms for each CPU core. All simulations use PME; the designation about PME tasks (or not) describe whether we use separate, dedicated PME ranks in the setup:



    In summary, the new Gromacs GPU implementation in release-4.6 is very much ready for general consumption, and going forward it will be the basis for all our acceleration work. You do not have to maintain a separate GPU-enabled version or really think about it at all; just try to enable GPU support when you are compiling Gromacs, remember to say "cutoff-scheme=verlet" in your mdp file, and it should just work. The log files will describe your performance, and also whether there are imbalances in usage between your CPU and GPU resources. There are some extra tricks to improve performance on nodes with lots of cores (e.g. Opteron machines with 32-64 cores and just 1-2 GPUs) - post to the mailing list if you are having problems!


    The above used input systems are available for download:


    Vendors providing systems with GROMACS-specific benchmarks

    The GPU landscape develops very rapidly, and GROMACS typically works great with the latest professional as well as consumer cards from all major vendors (both NVIDIA and AMD). Currently (November 2015), the best price/performance ratio is obtained with NVIDIA consumer GPUs such as GeForce Titan X or GeForce 980Ti, but AMD cards are also very good value for money. Professional cards from the Tesla/FirePro series mainly provide features such as error-correcting memory and double precision which GROMACS does not rely extensively on, but for dense cluster installations with high requirements for uptime and support this can still be the best option.

    We typically build simple nodes ourselves (since we prioritize price/performance), but if you prefer to go with turn-key solutions this is a list of vendors we have both found to be nice & MD-aware, they provide GROMACS-specific benchmarks for the machines they provide, and they will guarantee that you can reproduce these numbers with the GROMACS version on their hardware:



    (If you are a vendor: We are happy to include more links on this list, if you provide nodes that come with GROMACS preinstalled and you guarantee the benchmarks you cite are reproducible).

    Page last modified 22:15, 19 Nov 2015 by lindahl