KIT - AI4LT Lectures - Theses - Multi-modal Speech Recognition in Noisy Environments

Multi-modal Speech Recognition in Noisy Environments

Subject:Automatic Speech Recognition
Type:Masterarbeit
Supervisor:
Zhaolin Li
Add on:
Recognizing speech in complex, noisy environments where multiple conversations occur simultaneously—known as the cocktail-party challenge—remains a formidable task for machines. While humans effortlessly focus on a single speaker by leveraging a combination of auditory, visual, and semantic cues, current ASR systems struggle in these multi-speaker settings.

In this thesis, you are expected to build ASR system, simulating human-like attention by integrating different modalities, such as facial movements, lip activity, speaker identity, and conversational semantics to isolate and accurately transcribe individual speech streams.

Requirements:
Strong programming and debugging skills

Knowledge of Python and Pytorch

Knowledge of machine learning

Literatures:

Abouelenin, Abdelrahman, et al. "Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras." arXiv preprint arXiv:2503.01743 (2025).

Wu, Shilong, et al. "The multimodal information based speech processing (misp) 2023 challenge: Audio-visual target speaker extraction." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.