GROMACS: Fast, Free and Flexible MD
 
 
 
Scaling Benchmarks
Friday, 09 September 2005
It is easy to run GROMACS on multiple processors using MPI, and the performace can be very high if your system can can be balanced over the different processors using the shuffling option of grompp.

When you compare scaling number it is however important to realize that a high performance code is much more sensitive to the communication overhead than a slower alternative. Since GROMACS can be an order of magnitude faster than other programs it might not show as good relative scaling in all cases, but the raw speed measured as picoseconds of simulation time carried out per day usually outperforms the competition, and for real work that's the only interesting figure to compare.

To optimize your parallel simulations, remember that the processors need to communicate relatively small amounts of data quite often. This means that low latency is more important than high bandwidth to get good scaling on a multi-processor machine.

GROMACS 3.0 parallel benchmarks

We have used the large DPPC membrane system to show the parallelization and scaling capabilities of GROMACS. As described on the single CPU benchmark page, this consists of 1024 DPPC lipids with 23 water molecules per lipid, totalling to 121856 atoms. A twin-range group based cut-off is used, 1.8 nm for electrostatics and 1.0 nm for Lennard-Jones interactions. The long-range contribution to electrostatics is updated every 10 steps.

#Processors for Machine:

MachineCPUClock#1#2#4#8#12#16#20#24#32
IBM SP2Power 3395491052053875376947968841019
Scaling (IBM SP Switch)100%107%104%99%92%89%81%75%65%
LinuxPentium III80046106174274327394---
Scaling (100Mb Ethernet)100%114%94%76%60%53%---
LinuxPentium III8003475147283399508607695777
Scaling (Scali network)100%111%108%104%99%94%89%86%72%
LinuxXeon2800128358447583-----
Scaling (100Mbit Ethernet)100%140%87%57%-----
LinuxXeon2800128358637110913231563-1810-
Scaling (1000Mbit Ethernet)100%140%124%108%86%76%x%59%-
The IBM machine is configured with four processors on each node and uses the IBM Switch2 communication hardware. The Linux cluster has two processors on each node with normal 100 Mbit ethernet networking. The Scali cluster has special low-latency high throughput interconnections. Although the Scali processors are as fast as those in the other Linux cluster the assembly loops could not be used there (due to Linux configuration). The scaling with the assembly loops could be slightly worse for this reason.
 

GROMACS 2.0 parallel benchmarks

These numbers concern the same system as above, simulated with the now obsolete 2.0 version of GROMACS. The latest version is about 40% faster on all platforms (Except for Pentium III and Athlon processors where the increase is almost 100%), but since this cover a broader range of hardware it might still be of interest. Performance is quoted as ps/day.

#Processors for Machine:

MachineCPU #1#2#4#8#16#32#64
LinuxPentium II 11203764108--
Scaling100%91%85%74%62%--
IBM SP2Power216010-3873135230346
Scaling100%-90%87%81%69%52%
IBM SP2PPC 60433212-3976131224-
Scaling100%-84%82%70%60%-
SGI PCR1000019513234176105--
Scaling100%90%80%74%52%--
Cray O2000R1000025019335388169--
Scaling100%88%71%59%57%--

So, which machine do you recommend?

Well, it depends. If you don't have to pay for it yourself and absolutely want a "real" supercomputer the IBM SP2 machines are very nice, and we will probably provide shared memory parallization (threads) for those in the future.

On the other hand, if cost is an issue there is really no alternative to a cluster of dual Pentium or Athlon processors running Linux. That's what most of us are using in our labs.


 
Next >
 
Top! Top!
This page took 0.036017 seconds to load.