KONWIHR Project: Speech Recognition


A Parallel Implementation of the Acoustic Model Training for Automatic Speech Recognition


One of the main issues with automatic speech recognition (ASR) software is its performance. To some extent, this can be characterized as a trade-off between time and accuracy:

In the training phase, more data and model parameters will lead to a more accurate speech model at the cost of complexity (and compute time). In the recognition phase, faster processors and more memory allow a more detailed search for the best word hypothesis.

In general, ASR software heavily relies on statistical automata to model phonetic, lexical and language context, and Gaussian mixture models to characterize the features extracted from the acoustic signal. These models are described by a very large number of parameters, typically more than 500.000, that need to be estimated using machine learning techniques. The amount of speech data varies from 5-10 hours (about 500M-1G data at 16kHz, 16bit), for small vocabulary tasks (1.000-5.000 words), up to 500-1000 hours (50-100G) for large vocabulary (10.000-50.000 words) ASR systems.

Project summary

The existing serial model training algorithms were re-implemented to provide a scalable parallel algorithm. We implemented a governor that distributes the data as small packages of speech data and related models to each (available) node while minimizing the amount of model I/O necessary. Additionally, the governor keeps track of the iterative model changes and finalizes the model training at each iteration of the training process.

The training time of the ASR system could be reduced significantly. As the workload is distributed on demand, the speed-up was about linear with the number of nodes, with some limitations to I/O.

The current status of the project can be viewed at http://code.google.com/p/jstk/

KONWIHR funding

  • KONWIHR funding: two months during Multicore-Software-Initiative 2009/2010


  • Dipl.-Inf. Korbinian Riedhammer, Lehrstuhl f. Informatik 5 (Mustererkennung), Univ. Erlangen-Nuremberg

Short report on the project result: Run-time Improvements of Speech Processing Algorithms using Java and C++ via the Java Native Interface (JNI)

The Java Speech Tooklit (JSTK) is developed and maintained by the Speech Group at the University of Erlangen-Nuremberg. It is designed to provide both an API and stand-alone applications for the most popular speech processing tasks such as speech recognition, speaker verification, manual speech transcription and annotation, and evaluation of human rater tasks.

The JSTK is implemented in Java because of several reasons, among them high portability, ``developer-friendly'' features such as meaningful compile and run-time errors, and the high reproducibility of numeric computations due to the VM implementations. However, using Java comes at the expense of less performance for numerical computations: The VMs, in their current implementation, do not support hardware accelerations such as SIMD or AVX, which can greatly increase the performance of the computations. In a previous project, parts of the JSTK training and evaluation algorithms were parallelized, leading to an approximately linear decrease of the computation time with an increasing number of cores/threads. In this project, the numerical core components are implemented in native code that utilizes both vectorization of simple operations and specialized implementations of the logarithm and exponential function to speed up the computations; the methods are implemented in C++, built on the target machine, and called from Java using the Java Native Interface (JNI).

Gaussian mixture models (GMM, a sum of Gaussian probability distribution functions) are the center piece in most statistical speech processing algorithms such as speech, speaker or speaker state recognition. In a typical scenario, the system faces millions of samples in training, and several thousand for each test utterance (a second of speech typically results in about 100 samples), where the dimension of the sample vectors ranges in between 40 and 90. It is obvious that the main computational effort is to compute the argument to the exponential function (subtract mean from data point, multiply with inverse covariance matrix) and the argument to it.

In total, the number of samples processed per second could be improved by a factor of 10 for diagonal covariance matrices, and about 2.5 for full covariance matrices, by applying the following improvements to the code: Removal of divisions, compiler-friendly indexing, vectorization, loop unrolling, and the restriction to single precision. Further minor improvements could be achieved by replacing the standard implementations of the exponential and log functions by highly optimized versions such as the open source fmath or the Intel MKL.

The tree main take-home messages were that the basic optimization rules also apply to Java programming, special directives can be used to point the compiler to sections and loops that can should be vectorized or unrolled, and in case of JNI, the invocation overhead needs to be considered.