GROMACS: Fast, Free and Flexible MD
 
 
 
Single-processor Benchmarks
Friday, 09 September 2005

To present an overview of the GROMACS molecular dynamics performance and compare the simulation speed attainable on some of the most common hardware platforms we have constructed a benchmark of a few typical systems. The benchmarks all represent "real-life" examples, i.e. they have been taken from ongoing research projects in our labs or published articles.

The performance in the table below is quoted as picoseconds of simulation time per day, i.e. higher is better. For practical reasons we have also included the results when running on two processors with a dual Pentium III; more extensive parallel results are available on the scaling benchmark page.

The benchmarks system are:

  • Villin The Villin headpiece, a 35 residue peptide the structure of which has been determined by NMR (McKnight et al, 1997). This was used in a peptide folding simulation of a microsecond by Duan and Kollman (1998), and we chose an essentially identical setup for this benchmark. The system was simulated with 3000 water molecules in a truncated octahedron unit cell (slightly less than 10,000 atoms in total), using a group based cut-off for both electrostatic and Lennard-Jones interactions at 0.8 nm. A timestep of 2 fs was employed, LINCS was used to constrain protein bonds involving hydrogens, while SETTLE was used to maintain the water geometry. Neighborlists were used and updated each 10 fs. Berendsen temperature coupling was applied with a time constant of 0.1 ps, and pressure coupling with a time constant of 20 ps. For the peptide, the GROMOS96 force field was used, while the water was described with the TIP3P model.

  • Lys/Cut This is the larger lysozyme protein (pdb entry 2LZM), simulated using the GROMOS96 force field in SPC water. The total number of atoms was 23207. All hydrogens on the protein were treated as dummy particles to remove the fast bond and angle vibrations, allowing a time step of 4 fs (see the dummy section in the GROMACS manual). A rhombic dodecahedron simulation box was used, which is the most spherical alternative, and hence most suited to globular proteins. A Twin-range group based cut-off scheme was used for both Coulomb and Lennard-Jones. Interactions within 0.9 nm were calculated every step, and long-range forces out to 1.4 nm updated every 5 time steps during neighborlist generation.

  • Lys/PME To assess the performance of the lattice summation code, the lysozyme system was also run with the cut-off for Coulomb interactions replaced with the smooth Particle Mesh Ewald algorithm. The full direct and reciprocal space parts were calculated each step and a lattice spacing of 0.12 nm used.

  • DPPC A phospholipid membrane, consisting of 1024 dipalmitoylphosphatidylcholine (DPPC) lipids in a bilayer configuration with 23 water molecules per lipid, for a total of 121,856 atoms. It was simulated with a twin-range group based cut-off of 1.8 nm for electrostatics and 1.0 nm for Van der Waals interactions. The long-range Coulomb forces between 1.0 nm and 1.8 nm were updated every tenth integration step during neighborlist generation. The force field described by Berger et al (1997) was used for the lipids while the water was simulated with the SPC model.

  • Poly-CH2 A 6000-unit polyethylene molecule modeled with anisotropic united atoms (See Toxvaerd, 1990). This means the bonded forces act on the position of the carbon, while the Lennard-Jones interaction site is a united atom displaced to the center of the CH2 group. The force field of Boyd and coworkers was used (Pant et al, 1993), with flexible bonds and a 1 fs timestep. A 0.9 nm cut-off was employed to Lennard-Jones interactions and dispersion corrections applied. Although the anisotropic interaction sites increase the particle number to 12,000 they are easily implemented as dummy atoms, keeping the computational cost very close to that of a 6000 particle system.


And the results for a few common architectures. The average column gives the results in percent relative to the first entry, which is a reasonable indicator for the overall machine performance. The rate column gives the results normalized for the clockspeed of the first entry, and per processor. This is useful to see how efficient a particular platform is when the clock speed increases or multiple CPUs are used, but it is not meaningful to compare the rate of different architectures. N is the number of processor cores. For older processors there is usually one core per processor, new chips (e.g. Opteron-DC and similar chips from Intel or IBM) may have more than one core per processor. If you read N = 4 for Opteron-DC it means two chips with two cores each.

Machine CPU/Core Compiler Clock (MHz) Cache (kb) Benchmark
Type N Villin Lys/Cut Lys/PME DPPC Poly-CH2 Average Rate
Linux Athlon 1 gcc 800 512 2412 622 456 41 1001 100 1.00
Linux Athlon 1 gcc 1533 256 5526 1432 930 82 2579 224 1.17
Linux Athlon 2 gcc 1533 256 9869 2923 1721 223 3838 436 1.14
Linux Duron 1 gcc 1800 64 5399 929 725 66 2470 187 0.84
Linux Athlon 64 1 Intel 8 2200 512 9446 2592 1641 190 4092 408 1.48
Linux Opteron 1 Intel 8 2000 1024 9192 2568 1622 183 3696 393 1.57
Linux Opteron 2 Intel 8 2000 1024 18180 5082 3085 414 6856 788 1.58
Linux Opteron 4 Intel 8 2000 1024 26570 7848 4800 782 12000 1304 1.30
Linux Opteron-DC 1 gcc 3.4.3 2000 1024 9956 2880 1841 210 4630 450 1.80
Linux Opteron-DC 2 gcc 3.4.3 2000 1024 19196 5952 3600 466 8470 904 1.81
Linux Opteron-DC 4 gcc 3.4.3 2000 1024 31993 9331 6171 870 16000 1580 1.58
Linux Opteron-DC 4 pathscale 2000 1024 31993 9590 6171 880 16000 1593 1.59
Linux Pentium 3 1 gcc 800 256 2330 662 455 45 960 101 1.02
Linux Pentium 3 2 gcc 800 256 4080 1115 608 103 1385 174 0.87
Linux Pentium 3 1 gcc 1133 512 4041 1186 802 74 1872 180 1.27
Linux Celeron 1 gcc 1400 256 3560 1075 673 64 1455 153 0.88
Linux Pentium 4 1 gcc 1700 256 5244 1375 849 84 2130 208 0.98
Linux Pentium 4 1 gcc 2000 256 6192 1494 938 92 2518 235 0.94
Linux Pentium 4 1 Intel 8 3000 1024 10176 2280 1353 164 3045 357 0.95
Linux Xeon P4 1 gcc 2200 512 7068 2023 1183 109 2766 283 1.03
Linux Xeon P4 2 gcc 2200 512 12336 3756 2057 286 4548 543 0.99
Linux Xeon P4 1 gcc 2800 512 8550 2513 1340 128 3370 340 0.97
Linux Xeon P4 2 gcc 2800 512 15425 4210 2271 358 5999 657 0.94
Linux Itanium 1 Intel 5 [1] 800 4096 2542 1338 716 59 1097 146 1.46
SGI O200 R12000 1 SGI 270 2048 1129 326 311 27 910 64 1.92
Sun Ultrasparc IIi 1 SUN 440 2048 1115 310 250 25 760 57 1.05
Compaq ES40 Alpha 21264 1 Compaq 667 8192 2102 652 571 48 1362 114 1.37
Compaq ES45 Alpha 21264 1 Compaq 1000 8192 4376 1239 1066 92 2716 222 1.78
IBM SP2 Power 3 1 IBM 395 4096 2109 607 393 49 1163 101 2.05
Apple G4 PPC 7455 1 gcc 1250 2048 6279 1712 855 92 1881 227 1.45
Apple G4 PPC 7455 [2] 2 gcc 1250 2048 9195 2784 1365 144 2699 349 1.12
Apple G5 PPC 970 1 gcc 1600 512 9443 2679 1260 123 3234 344 1.72
Apple G5 PPC 970 1 IBM 1600 512 10320 2696 1512 127 3584 372 1.86
Apple G5 PPC 970 1 IBM 2500 512 15309 4177 2213 175 5069 544 1.74
Apple MacPro Xeon Woodcrest
4 gcc
2666 6000 48000 15026 8597 970
20571

 
IBM JS20 PPC 970 1 IBM 1600 512 10258 2709 1533 126 3512 371 1.86
IBM JS20 PPC 970 2 IBM 1600 512 17802 4950 2921 374 6005 737 1.84

Notes:

[1] The Itanium doesn't have assembly inner loops, but we used fortran inner loops (that requires some tweaking and is not enabled by default since the intel compiler linking stage is quite buggy) with version 5.0.1 of the Intel C and Fortran compiler.

[2] Shared-memory communication did not work on OS X 10.2, but probably does now. Scaling should be much better on G5 machines.

It isn't entirely fair to compare all architectures against each other directly from the table; some of the machines are older and some are newer, but it will give you a general idea of the performance.


Which machine do you recommend?

Opteron, G5, or Xeon. Note that we have found and worked around a bug in Athlon CPUs - upgrade to Gromacs 3.2.1 or later if you experience strange crashes on AMD hardware.

Gromacs is running the x86 CPUs hotter than any other program we know of, including dedicated testing programs like cpu-burnin. Download the cpu stress test program from the contributions page and test yourself with the LM-sensors package :-)

 
< Prev   Next >
 
Top! Top!
This page took 0.020304 seconds to load.