Introduction

The Multimodal Information-based Speech Processing (MISP) Challenge is a long-running community effort to advance robust speech and language technologies by leveraging multimodal cues in realistic application scenarios. Since 2021, the MISP Challenge series has served as a public platform that releases datasets, establishes benchmark tasks, and maintains evaluation protocols and leaderboards for fair comparison.

The previous MISP Challenges in 2021, 2022, and 2023 focused on a realistic home conversational scenario, where multiple speakers chatted in Mandarin while watching TV in a living room. To support this setting, the organizing team released a large-scale audio-visual Chinese home-conversation corpus and established multiple benchmark tasks, including audio-visual wake-word spotting, audio-visual target-speaker extraction, audio-visual speaker diarization, and audio-visual speech recognition. These released resources attracted extensive participation from the global research community, with over 150 teams downloading the dataset, more than 60 teams actively submitting results and 15 research papers presented at ICASSP 2022, 2023, and 2024.

In 2025, the MISP Challenge expanded further to a new scenario: meetings. The organizing team released the MISP-Meeting dataset to support tasks such as audio-visual speaker diarization and audio-visual speech recognition. The 2025 edition attracted over 60 teams that downloaded the dataset, more than 10 teams that submitted results, and 4 research papers presented at Interspeech 2025. Collectively, these outcomes demonstrate not only the organizing team's capability and long-term commitment to maintaining the MISP Challenge series, but also the sustained engagement of the MISP research community, providing a strong foundation for future editions.

Building upon the success of previous audio-visual benchmarks, the MISP 2026 Challenge takes a significant step forward by explicitly introducing the text modality, moving beyond audio-visual settings toward a more realistic audio–visual–text tri-modal formulation. In practical speech-enabled systems, textual context—either manually provided or automatically generated—often interacts with acoustic and visual cues. Therefore, understanding how these three modalities complement each other is a key research question for next-generation multimodal speech processing.

As a first step toward this tri-modal paradigm, the MISP 2026 Challenge focuses on the audio-visual-text query-by-example keyword spotting (AVT-QEKS) task. In this task, participants are provided with an enrollment example for each keyword, which includes spoken audio, visual speech cues (e.g., lip movements), and auxiliary textual information. Given a query clip containing audio and corresponding visual cues, participants are required to determine whether the query contains the same keyword as the enrollment. As a result, AVT-QEKS emphasizes open-vocabulary matching under realistic acoustic conditions. Participants are encouraged to design robust cross-modal models that effectively integrate audio, visual, and textual cues to achieve reliable keyword matching.

The following resources will be provided:

The MISP-QEKS dataset with official train/dev/eval splits and evaluation protocol.

Official baseline system, pretrained feature extractors, and an example checkpoint.

Public leaderboard for fair comparison across In-Vocabulary (IV) and Out-of-Vocabulary (OOV) settings.

A challenge session to foster communication and highlight effective techniques (, subject to sufficient paper submissions).

A summary publication discussing findings and future directions.

Planned Schedule(AOE Time)

Registration opens and MISP-QEKS dataset release: March 18, 2026

Baseline release: April 1, 2026

Registration closes, challenge evaluation set release and leaderboard update for evaluation set: May 25, 2026

Leaderboard freeze: June 25, 2026

System report submission: July 1, 2026

Final paper submission: July 8, 2026

Multimodal Information Based Speech Processing (MISP) 2026 Challenge

Introduction

Planned Schedule(AOE Time)

Organizers

Hang Chen

Jun Du

Chin-Hui LEE

Jingdong Chen

Shinji Watanabe

Sabato Marco Siniscalchi

Odette Scharenborg

Contact Us