| Scaling Benchmarks |
| Friday, 09 September 2005 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It is easy to run GROMACS on multiple processors using MPI, and
the performace can be very high if your system can can be balanced over
the different processors using the shuffling option of grompp. When
you compare scaling number it is however important to realize that a
high performance code is much more sensitive to the communication
overhead than a slower alternative. Since GROMACS can be an order of
magnitude faster than other programs it might not show as good relative scaling
in all cases, but the raw speed measured as picoseconds of simulation
time carried out per day usually outperforms the competition, and for
real work that's the only interesting figure to compare. To
optimize your parallel simulations, remember that the processors need
to communicate relatively small amounts of data quite often. This means
that low latency is more important than high bandwidth to get good
scaling on a multi-processor machine. GROMACS 3.0 parallel benchmarksWe have used the large DPPC membrane system to show the parallelization and scaling capabilities of GROMACS. As described on the single CPU benchmark page,
this consists of 1024 DPPC lipids with 23 water molecules per lipid,
totalling to 121856 atoms. A twin-range group based cut-off is used,
1.8 nm for electrostatics and 1.0 nm for Lennard-Jones interactions.
The long-range contribution to electrostatics is updated every 10 steps. #Processors for Machine:
The
IBM machine is configured with four processors on each node and uses
the IBM Switch2 communication hardware. The Linux cluster has two
processors on each node with normal 100 Mbit ethernet networking. The
Scali cluster has special low-latency high throughput interconnections.
Although the Scali processors are as fast as those in the other Linux
cluster the assembly loops could not be used there (due to Linux
configuration). The scaling with the assembly loops could be slightly
worse for this reason. GROMACS 2.0 parallel benchmarksThese
numbers concern the same system as above, simulated with the now
obsolete 2.0 version of GROMACS. The latest version is about 40% faster
on all platforms (Except for Pentium III and Athlon processors where
the increase is almost 100%), but since this cover a broader range of
hardware it might still be of interest. Performance is quoted as ps/day. #Processors for Machine:
So, which machine do you recommend?Well,
it depends. If you don't have to pay for it yourself and absolutely
want a "real" supercomputer the IBM SP2 machines are very nice, and we
will probably provide shared memory parallization (threads) for those
in the future. On the other hand, if cost is an issue
there is really no alternative to a cluster of dual Pentium or Athlon
processors running Linux. That's what most of us are using in our labs. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Next > |
|---|






