The Verlet SIMD non-bonded kernels and search code calculate interactions for all particle pairs of an "i-cluster" of size 4 with a "j-cluster" of size N. There are two flavors of kernels:
- 4xN: SIMD with N, replicates all SIMD instruction 4 times, 1 for each i-particle
- 2x(N+N): SIMD width 2*N, puts 2 i-particles in one SIMD register, currently only used with single precision AVX-256 to run 4x4 cluster pairs on 8-wide SIMD
The kernels currently support reaction-field (and thus also cut-off) and Ewald/PME correction either through tables or through an analytical correction. The potential is by default shifted, such that it is zero at the cut-off, this correction can be turned of in grompp.
The SIMD search and kernel code is nearly fully implemented through macros. Currently all combinations of x86 128-bit and 256-bit SIMD instructions (SSE and AVX) in single and double precision are implemented. Adding support for a new SIMD intruction set and/or a new architecture is nearly as simple as implementing a few macros.
The SIMD/kernel combinations that will give reasonable performance (and that don't require much new code) are:
- 2-way SIMD: 4xN = 4x2 pairs
- 4-way SIMD: 4xN = 4x4 pairs
- 8-way SIMD: 4xN = 4x8 pairs or 2x(N+N) = 4x4 pairs
- 16-way SIMD: 2x(N+N) = 4x8 pairs
Implementing support for a new SIMD instruction set
Implementing support for a new SIMD instruction set for the Verlet kernels and search should be straightforward and relatively little work. Most work is probably implementing the data shuffling macros in src/mdlib/nbnxn_kernels/nbnxn_kernel_simd_utils.h, but most of that can be skipped if you don't need all the combinations of functionality. Unaligned loads and stores are never used, except for an unaligned load in the double precision Ewald table lookup, but this can be skipped in favor of the analytical correction (or an aligned table can be implemented without much effort).
- In include/types/nbnxn_verlet.h:
- define GMX_NBNXN_SIMD
- set GMX_NBNXN_SIMD_BITWIDTH to the desired SIMD width in bits
- define GMX_NBNXN_SIMD_4XN and/or GMX_NBNXN_SIMD_2XNN to get the kernels you want
- In include/gmx_simd_macros.h
- define the 20 or so SIMD instruction macros, some might not even be required
- In src/mdlib/nbnxn_kernels/nbnxn_kernel_simd_???
- you need to edit, and possibly disable, kernels for 4xN and 2xNN, depending on what you want
- most of the kernel code is taken care of by include/gmx_simd_macros.h
- the diagonal and exclusion mask implementation is usually instruction set specific, implement these in ...outer.h and ...inner.h (the masking will soon be generalized, so this is not necessary any more)
- In src/mdlib/nbnxn_kernels/nbnxn_kernel_simd_utils.h
- implement some data shuffling macros for LJ parameter and Ewald table loading, if you don't want to bother with the Ewald tables, you can only support analytical Ewald correction
- The analytical Ewald correction code currently does not use macros, it needs to be copied from one of the x86 math files in include, or you can skip it if you only need Ewald table correction.
- In pick_nbnxn_kernel_cpu in src/mdlib/forcerec.c you can set if you want 4xN or 2x(N+N) and Ewald table or analytical kernels by default. Environment variables can switch this at runtime, if you implemented multiple kernel options.
- The pair search cluster pair pruning in src/mdlib/nbnxn_search.c is automatically accelerated through gmx_simd_macros.h