The Multimodal Information-based Speech Processing (MISP) Challenge is a long-running community effort to advance robust speech and language technologies by leveraging multimodal cues in realistic application scenarios. Since 2021, the MISP Challenge series has served as a public platform that releases datasets, establishes benchmark tasks, and maintains evaluation protocols and leaderboards for fair comparison.
The previous MISP Challenges in 2021, 2022, and 2023 focused on a realistic home conversational scenario, where multiple speakers chatted in Mandarin while watching TV in a living room. To support this setting, the organizing team released a large-scale audio-visual Chinese home-conversation corpus and established multiple benchmark tasks, including audio-visual wake-word spotting, audio-visual target-speaker extraction, audio-visual speaker diarization, and audio-visual speech recognition. These released resources attracted extensive participation from the global research community, with over 150 teams downloading the dataset, more than 60 teams actively submitting results and 15 research papers presented at ICASSP 2022, 2023, and 2024.
In 2025, the MISP Challenge expanded further to a new scenario: meetings. The organizing team released the MISP-Meeting dataset to support tasks such as audio-visual speaker diarization and audio-visual speech recognition. The 2025 edition attracted over 60 teams that downloaded the dataset, more than 10 teams that submitted results, and 4 research papers presented at Interspeech 2025. Collectively, these outcomes demonstrate not only the organizing team's capability and long-term commitment to maintaining the MISP Challenge series, but also the sustained engagement of the MISP research community, providing a strong foundation for future editions.
Building upon the success of previous audio-visual benchmarks, the MISP 2026 Challenge takes a significant step forward by explicitly introducing the text modality, moving beyond audio-visual settings toward a more realistic audio–visual–text tri-modal formulation. In practical speech-enabled systems, textual context—either manually provided or automatically generated—often interacts with acoustic and visual cues. Therefore, understanding how these three modalities complement each other is a key research question for next-generation multimodal speech processing.
As a first step toward this tri-modal paradigm, the MISP 2026 Challenge focuses on the audio-visual-text query-by-example keyword spotting (AVT-QEKS) task. In this task, participants are provided with an enrollment example for each keyword, which includes spoken audio, visual speech cues (e.g., lip movements), and auxiliary textual information. Given a query clip containing audio and corresponding visual cues, participants are required to determine whether the query contains the same keyword as the enrollment. As a result, AVT-QEKS emphasizes open-vocabulary matching under realistic acoustic conditions. Participants are encouraged to design robust cross-modal models that effectively integrate audio, visual, and textual cues to achieve reliable keyword matching.
The following resources will be provided:
For additional information, please email us at mispchallenge@gmail.com.