To present an overview of the GROMACS molecular dynamics performance and compare the simulation speed attainable on some of the most common hardware platforms we have constructed a benchmark of a few typical systems. The benchmarks all represent "real-life" examples, i.e. they have been taken from ongoing research projects in our labs or published articles.
The performance in the table below is quoted as picoseconds of simulation time per day, i.e. higher is better. For practical reasons we have also included the results when running on two processors with a dual Pentium III; more extensive parallel results are available on the scaling benchmark page.
The benchmarks system are:
- Villin The Villin headpiece, a 35 residue peptide the structure of which has been determined by NMR (McKnight et al, 1997). This was used in a peptide folding simulation of a microsecond by Duan and Kollman (1998), and we chose an essentially identical setup for this benchmark. The system was simulated with 3000 water molecules in a truncated octahedron unit cell (slightly less than 10,000 atoms in total), using a group based cut-off for both electrostatic and Lennard-Jones interactions at 0.8 nm. A timestep of 2 fs was employed, LINCS was used to constrain protein bonds involving hydrogens, while SETTLE was used to maintain the water geometry. Neighborlists were used and updated each 10 fs. Berendsen temperature coupling was applied with a time constant of 0.1 ps, and pressure coupling with a time constant of 20 ps. For the peptide, the GROMOS96 force field was used, while the water was described with the TIP3P model.
- Lys/Cut This is the larger lysozyme protein (pdb entry 2LZM), simulated using the GROMOS96 force field in SPC water. The total number of atoms was 23207. All hydrogens on the protein were treated as dummy particles to remove the fast bond and angle vibrations, allowing a time step of 4 fs (see the dummy section in the GROMACS manual). A rhombic dodecahedron simulation box was used, which is the most spherical alternative, and hence most suited to globular proteins. A Twin-range group based cut-off scheme was used for both Coulomb and Lennard-Jones. Interactions within 0.9 nm were calculated every step, and long-range forces out to 1.4 nm updated every 5 time steps during neighborlist generation.
- Lys/PME To assess the performance of the lattice summation code, the lysozyme system was also run with the cut-off for Coulomb interactions replaced with the smooth Particle Mesh Ewald algorithm. The full direct and reciprocal space parts were calculated each step and a lattice spacing of 0.12 nm used.
- DPPC A phospholipid membrane, consisting of 1024 dipalmitoylphosphatidylcholine (DPPC) lipids in a bilayer configuration with 23 water molecules per lipid, for a total of 121,856 atoms. It was simulated with a twin-range group based cut-off of 1.8 nm for electrostatics and 1.0 nm for Van der Waals interactions. The long-range Coulomb forces between 1.0 nm and 1.8 nm were updated every tenth integration step during neighborlist generation. The force field described by Berger et al (1997) was used for the lipids while the water was simulated with the SPC model.
- Poly-CH2 A 6000-unit polyethylene molecule modeled with anisotropic united atoms (See Toxvaerd, 1990). This means the bonded forces act on the position of the carbon, while the Lennard-Jones interaction site is a united atom displaced to the center of the CH2 group. The force field of Boyd and coworkers was used (Pant et al, 1993), with flexible bonds and a 1 fs timestep. A 0.9 nm cut-off was employed to Lennard-Jones interactions and dispersion corrections applied. Although the anisotropic interaction sites increase the particle number to 12,000 they are easily implemented as dummy atoms, keeping the computational cost very close to that of a 6000 particle system.
And the results for a few common architectures. The average column gives the results in percent relative to the first entry, which is a reasonable indicator for the overall machine performance. The rate column gives the results normalized for the clockspeed of the first entry, and per processor. This is useful to see how efficient a particular platform is when the clock speed increases or multiple CPUs are used, but it is not meaningful to compare the rate of different architectures. N is the number of processor cores. For older processors there is usually one core per processor, new chips (e.g. Opteron-DC and similar chips from Intel or IBM) may have more than one core per processor. If you read N = 4 for Opteron-DC it means two chips with two cores each.
| Machine |
CPU/Core |
Compiler |
Clock (MHz) |
Cache (kb) |
Benchmark |
|
| Type |
N |
Villin |
Lys/Cut |
Lys/PME |
DPPC |
Poly-CH2 |
Average |
Rate |
| Linux |
Athlon |
1 |
gcc |
800 |
512 |
2412 |
622 |
456 |
41 |
1001 |
100 |
1.00 |
| Linux |
Athlon |
1 |
gcc |
1533 |
256 |
5526 |
1432 |
930 |
82 |
2579 |
224 |
1.17 |
| Linux |
Athlon |
2 |
gcc |
1533 |
256 |
9869 |
2923 |
1721 |
223 |
3838 |
436 |
1.14 |
| Linux |
Duron |
1 |
gcc |
1800 |
64 |
5399 |
929 |
725 |
66 |
2470 |
187 |
0.84 |
| Linux |
Athlon 64 |
1 |
Intel 8 |
2200 |
512 |
9446 |
2592 |
1641 |
190 |
4092 |
408 |
1.48 |
| Linux |
Opteron |
1 |
Intel 8 |
2000 |
1024 |
9192 |
2568 |
1622 |
183 |
3696 |
393 |
1.57 |
| Linux |
Opteron |
2 |
Intel 8 |
2000 |
1024 |
18180 |
5082 |
3085 |
414 |
6856 |
788 |
1.58 |
| Linux |
Opteron |
4 |
Intel 8 |
2000 |
1024 |
26570 |
7848 |
4800 |
782 |
12000 |
1304 |
1.30 |
| Linux |
Opteron-DC |
1 |
gcc 3.4.3 |
2000 |
1024 |
9956 |
2880 |
1841 |
210 |
4630 |
450 |
1.80 |
| Linux |
Opteron-DC |
2 |
gcc 3.4.3 |
2000 |
1024 |
19196 |
5952 |
3600 |
466 |
8470 |
904 |
1.81 |
| Linux |
Opteron-DC |
4 |
gcc 3.4.3 |
2000 |
1024 |
31993 |
9331 |
6171 |
870 |
16000 |
1580 |
1.58 |
| Linux |
Opteron-DC |
4 |
pathscale |
2000 |
1024 |
31993 |
9590 |
6171 |
880 |
16000 |
1593 |
1.59 |
| Linux |
Pentium 3 |
1 |
gcc |
800 |
256 |
2330 |
662 |
455 |
45 |
960 |
101 |
1.02 |
| Linux |
Pentium 3 |
2 |
gcc |
800 |
256 |
4080 |
1115 |
608 |
103 |
1385 |
174 |
0.87 |
| Linux |
Pentium 3 |
1 |
gcc |
1133 |
512 |
4041 |
1186 |
802 |
74 |
1872 |
180 |
1.27 |
| Linux |
Celeron |
1 |
gcc |
1400 |
256 |
3560 |
1075 |
673 |
64 |
1455 |
153 |
0.88 |
| Linux |
Pentium 4 |
1 |
gcc |
1700 |
256 |
5244 |
1375 |
849 |
84 |
2130 |
208 |
0.98 |
| Linux |
Pentium 4 |
1 |
gcc |
2000 |
256 |
6192 |
1494 |
938 |
92 |
2518 |
235 |
0.94 |
| Linux |
Pentium 4 |
1 |
Intel 8 |
3000 |
1024 |
10176 |
2280 |
1353 |
164 |
3045 |
357 |
0.95 |
| Linux |
Xeon P4 |
1 |
gcc |
2200 |
512 |
7068 |
2023 |
1183 |
109 |
2766 |
283 |
1.03 |
| Linux |
Xeon P4 |
2 |
gcc |
2200 |
512 |
12336 |
3756 |
2057 |
286 |
4548 |
543 |
0.99 |
| Linux |
Xeon P4 |
1 |
gcc |
2800 |
512 |
8550 |
2513 |
1340 |
128 |
3370 |
340 |
0.97 |
| Linux |
Xeon P4 |
2 |
gcc |
2800 |
512 |
15425 |
4210 |
2271 |
358 |
5999 |
657 |
0.94 |
| Linux |
Itanium |
1 |
Intel 5 [1] |
800 |
4096 |
2542 |
1338 |
716 |
59 |
1097 |
146 |
1.46 |
| SGI O200 |
R12000 |
1 |
SGI |
270 |
2048 |
1129 |
326 |
311 |
27 |
910 |
64 |
1.92 |
| Sun |
Ultrasparc IIi |
1 |
SUN |
440 |
2048 |
1115 |
310 |
250 |
25 |
760 |
57 |
1.05 |
| Compaq ES40 |
Alpha 21264 |
1 |
Compaq |
667 |
8192 |
2102 |
652 |
571 |
48 |
1362 |
114 |
1.37 |
| Compaq ES45 |
Alpha 21264 |
1 |
Compaq |
1000 |
8192 |
4376 |
1239 |
1066 |
92 |
2716 |
222 |
1.78 |
| IBM SP2 |
Power 3 |
1 |
IBM |
395 |
4096 |
2109 |
607 |
393 |
49 |
1163 |
101 |
2.05 |
| Apple G4 |
PPC 7455 |
1 |
gcc |
1250 |
2048 |
6279 |
1712 |
855 |
92 |
1881 |
227 |
1.45 |
| Apple G4 |
PPC 7455 [2] |
2 |
gcc |
1250 |
2048 |
9195 |
2784 |
1365 |
144 |
2699 |
349 |
1.12 |
| Apple G5 |
PPC 970 |
1 |
gcc |
1600 |
512 |
9443 |
2679 |
1260 |
123 |
3234 |
344 |
1.72 |
| Apple G5 |
PPC 970 |
1 |
IBM |
1600 |
512 |
10320 |
2696 |
1512 |
127 |
3584 |
372 |
1.86 |
| Apple G5 |
PPC 970 |
1 |
IBM |
2500 |
512 |
15309 |
4177 |
2213 |
175 |
5069 |
544 |
1.74 |
| Apple MacPro |
Xeon Woodcrest
|
4 |
gcc
|
2666 |
6000 |
48000 |
15026 |
8597 |
970
|
20571
|
|
|
| IBM JS20 |
PPC 970 |
1 |
IBM |
1600 |
512 |
10258 |
2709 |
1533 |
126 |
3512 |
371 |
1.86 |
| IBM JS20 |
PPC 970 |
2 |
IBM |
1600 |
512 |
17802 |
4950 |
2921 |
374 |
6005 |
737 |
1.84 |
Notes:[1] The Itanium doesn't have assembly inner loops, but we used fortran inner loops (that requires some tweaking and is not enabled by default since the intel compiler linking stage is quite buggy) with version 5.0.1 of the Intel C and Fortran compiler.
[2] Shared-memory communication did not work on OS X 10.2, but probably does now. Scaling should be much better on G5 machines.
It isn't entirely fair to compare all architectures against each other directly from the table; some of the machines are older and some are newer, but it will give you a general idea of the performance.
Which machine do you recommend?
Opteron, G5, or Xeon. Note that we have found and worked around a bug in Athlon CPUs - upgrade to Gromacs 3.2.1 or later if you experience strange crashes on AMD hardware.
Gromacs is running the x86 CPUs hotter than any other program we know of, including dedicated testing programs like cpu-burnin. Download the cpu stress test program from the contributions page and test yourself with the LM-sensors package :-)