Multi-modal Speech Recognition in Noisy Environments

  • Subject:Automatic Speech Recognition
  • Type:Masterarbeit
  • Supervisor:

    Zhaolin Li

  • Add on:

    Recognizing speech in complex, noisy environments where multiple conversations occur simultaneously—known as the cocktail-party challenge—remains a formidable task for machines. While humans effortlessly focus on a single speaker by leveraging a combination of auditory, visual, and semantic cues, current ASR systems struggle in these multi-speaker settings.

    In this thesis, you are expected to build ASR system, simulating human-like attention by integrating different modalities, such as facial movements, lip activity, speaker identity, and conversational semantics to isolate and accurately transcribe individual speech streams.

     

    Requirements:
    Strong programming and debugging skills

    Knowledge of Python and Pytorch

    Knowledge of machine learning

     

    Literatures:

    Abouelenin, Abdelrahman, et al. "Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras." arXiv preprint arXiv:2503.01743 (2025).

    Wu, Shilong, et al. "The multimodal information based speech processing (misp) 2023 challenge: Audio-visual target speaker extraction." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.