Available at: https://digitalcommons.calpoly.edu/theses/1382
Date of Award
MS in Electrical Engineering
Xiaozheng (Jane) Zhang
Audio-visual automatic speech recognition (AVASR) is a speech recognition technique integrating audio and video signals as input. Traditional audio-only speech recognition system only uses acoustic information from an audio source. However the recognition performance degrades significantly in acoustically noisy environments. It has been shown that visual information also can be used to identify speech. To improve the speech recognition performance, audio-visual automatic speech recognition has been studied. In this paper, we focus on the design of the visual front end of an AVASR system, which mainly consists of face detection and lip localization. The front end is built upon the AVICAR database that was recorded in moving vehicles. Therefore, diverse lighting conditions and poor quality of imagery are the problems we must overcome.
We first propose the use of the Viola-Jones face detection algorithm that can process images rapidly with high detection accuracy. When the algorithm is applied to the AVICAR database, we reach an accuracy of 89% face detection rate. By separately detecting and integrating the detection results from all different color channels, we further improve the detection accuracy to 95%. To reliably localize the lips, three algorithms are studied and compared: the Gabor filter algorithm, the lip enhancement algorithm, and the modified Viola-Jones algorithm for lip features. Finally, to increase detection rate, a modified Viola-Jones algorithm and lip enhancement algorithms are cascaded based on the results of three lip localization methods. Overall, the front end achieves an accuracy of 90% for lip localization.