UTBEST3D: numerical solver for regional and coastal ocean flow
We develop UTBEST3D, a numerical solver for regional and coastal ocean flow based on the 3D shallow water equations. The model allows for simulations with different levels of physical accuracy (and computational complexity), ranging from barotropic quasi-2D representations to fully baroclinic setups with turbulence closure, resulting in varying numbers of equations and unknowns. Space discretization is carried out on a vertically aligned prismatic mesh using the local discontinuous Galerkin finite element method and solutions are advanced in time with a semi-implicit method based on TVB (total variation bounded) Runge-Kutta methods. UTBEST3D is written in C++ using an object-oriented design and parallelized with MPI and OpenMP.
Hardware counter measurements of the program using LIKWID showed an insufficient exploitation of current computer architectures: on Intel processors with Ivy Bridge architecture less than 50% of the available memory bandwidth was used; the share of floating point operations in the total instruction count was comparatively small, indicating unnecessary overhead; and, almost no operations were vectorized by the compiler.
A major part of the computation time is spent in assembly routines, consisting of a loop over all mesh entities of a type (prism, horizontal face, vertical face) and all quadrature points on each entity. Inside, variables are evaluated (short scalar products), combined and the resulting values are used to update the right hand side vectors, requiring small matrix-vector- and axpy-operations. This excessive amount of short loops and the nested nature of their implementation are liable for the large number of branching and overhead instructions.
Consequently, the primary optimization concerns aimed at reducing this instruction count. Three approaches were used:
- Dissolving the encapsulated nature of the routines by combining everything into a long loop.
- Replacing custom iterators by direct memory access.
- Creating static code variants to convert loop lengths into compile time constants.
The last point was carried out using templating techniques. To maintain the flexible structure of the code (e.g., support different approximation orders for variables) template meta programming techniques were employed: a compile time linked list data structure providing the varying numbers of variables and local degrees of freedom.
Conclusion and outlook
After implementing the described optimizations, the total instruction count of a representative kernel dropped by more than 30%. The initial share of more than 20% of branching and other instructions was reduced to less than 6%, and the walltime of the kernel was reduced by almost 50%. Unfortunately, we were not able to exploit auto vectorization features of the compiler in this kernel. All efforts to leverage SIMD were either not successful or did at best provide the same runtime.
We are in the process of applying the same transformations to the remaining integration kernels of the explicit time integration step and work on accomodating and exploiting the static information in other code parts as well. A future task is to conduct similar investigations for some of the diffusion schemes in the implicit time integration step.
Two months stay of Balthasar Reuter at RRZE during Multicore-Software-Initiative 2014.
- Balthasar Reuter, Applied Mathematics 1, FAU Erlangen-Nürnberg
- Dr. Vadym Aizinger, Applied Mathematics 1, FAU Erlangen-Nürnberg