Parallelization schemes and GPU acceleration: Szilard Pall, Session 2B

    Table of contents
    No headers

    This tutorial might end up being a mix of a lecture/discussion and tutorial, depending on the availability of internet access or if the participants have laptops with accelerator hardware (read: Nvidia cards). Laptops really don't have GPUs that are fast enough for acceleration anyway; while the model numbers might sound very similar to the top-of-the line desktop GPUs, there is usually a small "m" appended, and that mobile designation means it is a much slower card.

    However, we have prepared ~20 guest accounts on our system in Stockholm, and if we have internet access available we will help you connect to five nodes each equipped with dual Xeon processors and dual Nvidia high-end GPUs. The accounts will only be available during the tutorial, and you will get login information when we meet.

    The last few years, hardware has gotten more complex with different types of accelerators and parallelization. GROMACS is pretty good at using this hardware, and Gromacs-4.6 is typically clearly faster than Gromacs-4.5 even if you don't do anything at all, but if you are willing to go a bit further and understand how modern parallelization and hardware really works you can get a lot better performance still.


    For this session, I will spend roughly half an hour going through a number of slides prepared by Szilárd Páll, who is our big GROMACS GPU guru. Understanding the limitations (and possibilities) of the implementation will hopefully help you both design and set up your simulations in better ways.  In particular, it turns out that many of the optimizations we did for GPUs are really nice for CPUs too, in particular when we run in parallel.


    After the presentation, we will work with two different simulation systems, and try a number of different settings (there is a separate directory with mdp parameter files):


    The first is a simulation of the protein RNAse, which consists of just over 24,000 atoms. We will start by simulating this on a single thread, then enable parallelization, but also show what we can achieve by using advanced GROMACS features such as rhombic dodecahedron boxes and virtual interaction sites for hydrogens, and finally have all this running on accelerators too.  At least in our group, it is a really important feature that we can have all these performance-enhancing options working at the same time, rather than having to choose one of them.


    You will see that it is difficult to push the parallelization of a small system very far - there simply isn't enough data to compute on - so we also have a second larger system, an alcohol dehydrogenase with roughly 135,000 atoms. You will also perform some practical studies on this protein during the workshop reception Friday evening :-)


    For both systems, the idea is to have you play around with settings and see how it affects performance. We will also try a number of hardware features such as pinning threads to CPU cores. If you have time, it might be a good idea to also see what happens when you try your own system!


    Page last modified 10:31, 13 Sep 2013 by lindahl