KONWIHR Project: Stencils on staggered hierarchical meshes


Implementation and optimization of stencil operations on staggered hierarchical meshes

Project summary

We optimized and parallelized a framework which compiles stencil operations defined by abstract operators into code, which performs the corresponding stencil update. Therefore, we are now able to formulate solvers for a large number of application problems (flow simulation, image analysis, ...) in an abstract way, and then solve efficiently on structured meshes.

In collaboration with the research group High Performance Computing of Prof. Dr. Gerhard Wellein we improved our code and data structures in several important ways:

  • Instead of keeping the data for each component (velocity, pressure, ...) separately, we store all data for a vector-valued function into a single array, which leads to a considerable reduction of address registers. The indexing order in this large array is (Z,Y,U,X), where U denotes the component index and X,Y,Z$space indices. X is the unit stride dimension, which allows for SIMD vectorization in the X-direction.
  • The Y-loop is blocked depending on cache size for fulfilling the so-called 3-level condition with the (Y,U,X)-hyperplanes in the outer Z-loop. This ensures that when performing one stencil update the data has to be loaded only once from main memory into the cache.
  • For parallelization, we established a thread pool which is used for initializing and working on the data fields ensuring that our data is accessed by the processes in a NUMA-friendly way.

The performance of the resulting code is rather close to the theoretical optimum estimated using the roofline model. For example, when looking at the test case of defect computation for a Stokes operator on a staggered mesh in 3D, the performance of our code is about 250 MLUPS on one Lima node (Westmere, 12 x 2.66 GHz). This performance corresponds to a memory bandwidth of at least 32,7 GB/sec, whereas the limit measured by the STREAM benchmark for this architecture lies at 40 GB/sec.

KONWIHR funding

  • KONWIHR funding: two months during Multicore-Software-Initiative 2013/2014


  • Prof. Dr. E. Bänsch, Priv.-Doz. Dr. N. Neuß, Angewandte Mathematik 3, FAU