Multi-modal Speech Recognition in Noisy Environments
- Subject:Automatic Speech Recognition
- Type:Masterarbeit
- Supervisor:
Zhaolin Li
- Add on:
Recognizing speech in complex, noisy environments where multiple conversations occur simultaneously—known as the cocktail-party challenge—remains a formidable task for machines. While humans effortlessly focus on a single speaker by leveraging a combination of auditory, visual, and semantic cues, current ASR systems struggle in these multi-speaker settings.
In this thesis, you are expected to build ASR system, simulating human-like attention by integrating different modalities, such as facial movements, lip activity, speaker identity, and conversational semantics to isolate and accurately transcribe individual speech streams.
Requirements:
Strong programming and debugging skillsKnowledge of Python and Pytorch
Knowledge of machine learning
Literatures:
Abouelenin, Abdelrahman, et al. "Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras." arXiv preprint arXiv:2503.01743 (2025).
Wu, Shilong, et al. "The multimodal information based speech processing (misp) 2023 challenge: Audio-visual target speaker extraction." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.