Introduction

The Multimodal Information-based Speech Processing (MISP) Challenge is a long-running community effort to advance robust speech and language technologies by leveraging multimodal cues in realistic application scenarios. Since 2021, the MISP Challenge series has served as a public platform that releases datasets, establishes benchmark tasks, and maintains evaluation protocols and leaderboards for fair comparison.

The previous MISP Challenges in 2021, 2022, and 2023 focused on a realistic home conversational scenario, where multiple speakers chatted in Mandarin while watching TV in a living room. To support this setting, the organizing team released a large-scale audio-visual Chinese home-conversation corpus and established multiple benchmark tasks, including audio-visual wake-word spotting, audio-visual target-speaker extraction, audio-visual speaker diarization, and audio-visual speech recognition. These released resources attracted extensive participation from the global research community, with over 150 teams downloading the dataset, more than 60 teams actively submitting results and 15 research papers presented at ICASSP 2022, 2023, and 2024.

In 2025, the MISP Challenge expanded further to a new scenario: meetings. The organizing team released the MISP-Meeting dataset to support tasks such as audio-visual speaker diarization and audio-visual speech recognition. The 2025 edition attracted over 60 teams that downloaded the dataset, more than 10 teams that submitted results, and 4 research papers presented at Interspeech 2025. Collectively, these outcomes demonstrate not only the organizing team's capability and long-term commitment to maintaining the MISP Challenge series, but also the sustained engagement of the MISP research community, providing a strong foundation for future editions.

Building upon the success of previous audio-visual benchmarks, the MISP 2026 Challenge takes a significant step forward by explicitly introducing the text modality, moving beyond audio-visual settings toward a more realistic audio–visual–text tri-modal formulation. In practical speech-enabled systems, textual context—either manually provided or automatically generated—often interacts with acoustic and visual cues. Therefore, understanding how these three modalities complement each other is a key research question for next-generation multimodal speech processing.

As a first step toward this tri-modal paradigm, the MISP 2026 Challenge focuses on the audio-visual-text query-by-example keyword spotting (AVT-QEKS) task. In this task, participants are provided with an enrollment example for each keyword, which includes spoken audio, visual speech cues (e.g., lip movements), and auxiliary textual information. Given a query clip containing audio and corresponding visual cues, participants are required to determine whether the query contains the same keyword as the enrollment. As a result, AVT-QEKS emphasizes open-vocabulary matching under realistic acoustic conditions. Participants are encouraged to design robust cross-modal models that effectively integrate audio, visual, and textual cues to achieve reliable keyword matching.

The following resources will be provided:

  • The MISP-QEKS dataset with official train/dev/eval splits and evaluation protocol.
  • Official baseline system, pretrained feature extractors, and an example checkpoint.
  • Public leaderboard for fair comparison across In-Vocabulary (IV) and Out-of-Vocabulary (OOV) settings.
  • A challenge session to foster communication and highlight effective techniques (, subject to sufficient paper submissions).
  • A summary publication discussing findings and future directions.

Planned Schedule(AOE Time)

  • Registration opens and MISP-QEKS dataset release: March 18, 2026
  • Baseline release: April 1, 2026
  • Registration closes, challenge evaluation set release and leaderboard update for evaluation set: May 25, 2026
  • Leaderboard freeze: June 25, 2026
  • System report submission: July 1, 2026
  • Final paper submission: July 8, 2026

Organizers


Hang Chen

University of Science and Technology of China


Jun Du

University of Science and Technology of China


Chin-Hui LEE

Georgia Institute of Technology


Jingdong Chen

Northwestern Polytechnical University


Shinji Watanabe

Carnegie Mellon University


Sabato Marco Siniscalchi

University of Palermo


Odette Scharenborg

Delft University of Technology

Contact Us

For additional information, please email us at mispchallenge@gmail.com.