Hochskalierbares Parsen natürlicher Sprache mit High-Performance- und Grid-Computing-Methoden
The Java Speech Tooklit (JSTK, http://code.google.com/p/jstk) is developed and maintained by the Speech Group at the University of Erlangen-Nuremberg. It is designed to provide both an API and stand-alone applications for the most popular speech processing tasks such as speech recognition, speaker verification, manual speech transcription and annotation, and evaluation of human rater tasks.
The JSTK is implemented in Java because of several reasons, among them high portability, ``developer-friendly'' features such as meaningful compile and run-time errors, and the high reproducibility of numeric computations due to the VM implementations. However, using Java comes at the expense of less performance for numerical computations: The VMs, in their current implementation, do not support hardware accelerations such as SIMD or AVX, which can greatly increase the performance of the computations. In a previous project, parts of the JSTK training and evaluation algorithms were parallelized, leading to an approximately linear decrease of the computation time with an increasing number of cores/threads. In this project, the numerical core components are implemented in native code that utilizes both vectorization of simple operations and specialized implementations of the logarithm and exponential function to speed up the computations; the methods are implemented in c++, built on the target machine, and called from Java using the Java Native Interface (JNI).
Gaussian mixture models (GMM, a sum of Gaussian probability distribution functions) are the center piece in most statistical speech processing algorithms such as speech, speaker or speaker state recognition. In a typical scenario, the system faces millions of samples in training, and several thousand for each test utterance (a second of speech typically results in about 100 samples), where the dimension of the sample vectors ranges in between 40 and 90. It is obvious that the main computational effort is to compute the argument to the exponential function (subtract mean from data point, multiply with inverse covariance matrix) and the argument to it.
In total, the number of samples processed per second could be improved by a factor of 10 for diagonal covariance matrices, and about 2.5 for full covariance matrices, by applying the following improvements to the code: Removal of divisions, compiler-friendly indexing, vectorization, loop unrolling, and the restriction to single precision. Further minor improvements could be achieved by replacing the standard implementations of the exponential and log functions by highly optimized versions such as the open source fmath or the Intel MKL.
The tree main take-home messages were that the basic optimization rules also apply to Java programming, special directives can be used to point the compiler to sections and loops that can should be vectorized or unrolled, and in case of JNI, the invocation overhead needs to be considered.
- KONWIHR funding: two months during Multicore-Software-Initiative 2012
- Korbinian Riedhammer, LS Informatik 5 (Mustererkennung), FAU