Towards Peak Performance of HF and DFT Codes on x86-64 Computer Architectures
Writing efficient HF (Hartree-Fock) and DFT (Density Functional Theory) programs on current computer architectures like the established x86_64 has always been a challenge in quantum chemistry. Efforts have been made to improve the performance of the TURBOMOLE integral code with the help of assembler techniques and the use of AVX commands. A number of standard benchmark molecules like taxol or big SiO2 clusters have been used during the optimization process.
Conclusions and outlook:
- The compiler does not generate SIMD code on Intel SandyBridge, which is no surprise. Even strait forward linear algebra codes often need hand optimized libs (like MKL). It is and was in the hands of skilled programmers to optimize.
- The integral kernel code is dominated in the handwritten SIMD version by execution speed for sqrt and inv. Therefor should not been done to get peak performance.
- As a prove of concept the code could speed up integrals at a factor of 14 compared to the compiler generated code using SIMD vectorization among other techniques for single cpu optimization.
- There are other borders which will become more important in the future, than the hunting for speed through SIMD vectorization or saving memory: The simple all double precission approach limits further performance gain. The codes reach the double precission border and no bigger molecules can be calculated due to 12 decimal digits energy limitation! A serious precission discussion is required in the future.
- There is much more optimization to be done. For a speedup of the total code: diis, io and diagonalization have further potential for improvements.
- A much faster implemetation is in reach but only in combination with state of the arte precission discussion.
- KONWIHR funding: two months during Multicore-Software-Initiative 2012/2013
- Thorsten Wölfle, Theoretische Chemie, Uni-Erlangen
- Prof. Görling, Theoretische Chemie, Uni-Erlangen