Deep Learning In Visual Speech Recognition: A Review Of Recent Developments And Performance Analysis
Main Article Content
Abstract
Visual Speech Recognition (VSR) is especially important in situations where acoustic signals are distorted, for example, in noisy environments or for people with hearing loss. This review aims at identifying the critical difficulty that arises from the visually similar phonemes or visemes which greatly affect the VSR. Visemes are other phonemes that are visually similar and hence pose a challenge when distinguishing them. We discuss the phoneme-viseme mapping and the effects of these similarities on VSR in low acoustic conditions. Different ways of improving VSR accuracy are described, such as data-oriented methods based on machine learning and deep learning algorithms, integration of vision with other sensory inputs, and context-based recognition systems that use linguistic context. We also discuss the existing methods of VSR systems including LipNet and LipReading in the Wild (LRW) and their drawbacks in practical scenarios. Future directions are concerned with the possibility of using both visual and degraded acoustic signals, new NN structures, individual VSR systems, and enhancements of real-time signal processing. The purpose of this review is to give a clear picture of the existing literature on the difficulties and possibilities of improving VSR accuracy in a low acoustic environment so that better communication technologies can be developed.