MISP Challenge 2026 Data

MISP-QEKS Dataset

Large-scale Audio-Visual-Text Benchmark for Query-by-Example Keyword Spotting

The MISP 2026 Challenge uses the MISP-QEKS dataset, which is the first large-scale benchmark explicitly designed for audio–visual–text tri-modal query-by-example keyword spotting (QEKS). MISP-QEKS combines the linguistic richness and speaker diversity of existing open-source corpora with realistic noise conditions, producing a dataset that is both scalable and faithful to real deployment scenarios.

Creation Pipeline

As illustrated in Figure 2, MISP-QEKS is constructed using a hybrid strategy. We start with an automatic pipeline that extracts word-level audio-visual-text clips from sentence-level corpora that provide synchronized audio, video, and transcripts. We then record diverse background noises in everyday environments and mix them with the clean clips at controlled signal-to-noise ratios (SNRs). Finally, all samples are organized into enrollment–query pairs, resulting in the first large-scale corpus for tri-modal QEKS.

MISP-QEKS data processing pipeline: keyword clipping, noisy simulation, and sample-pair construction.

Figure 2: Overview of the MISP-QEKS data processing pipeline: (a) keyword clipping, (b) noisy simulation, (c) enrollment–query pair construction.

Statistical Information

MISP-QEKS is constructed from the Oxford-BBC LRS2 corpus. The dataset provides approximately 193.606 hours of tri-modal data, covering 9,830+ distinct keywords and 610,000 enrollment–query pairs (122,000 positive and 488,000 negative). The dataset is split under a speaker-independent criterion.

Split	Duration (h)	Keywords	Pairs	Positive	Negative
Train	157.756	8,357	500,000	100,000	400,000
Dev	3.245	2,247	10,000	2,000	8,000
Eval-seen (IV)	15.300	2,174	50,000	10,000	40,000
Eval-blind (OOV)	17.305	1,445	50,000	10,000	40,000

The development set partially overlaps with training keywords for hyperparameter tuning. For evaluation, Eval-seen is drawn from training keywords (in-vocabulary), while Eval-blind contains unseen keywords (out-of-vocabulary), enabling both IV and OOV assessment.

For the MISP 2026 Challenge, we will additionally release a new held-out evaluation set with both seen and unseen keywords to enable fair leaderboard ranking and to assess both in-vocabulary and out-of-vocabulary generalization.

Downloads

The MISP-QEKS dataset is hosted on HuggingFace under controlled access.

Before downloading, participants must:

Complete the Registration Form.
Use the same email address to request dataset access on HuggingFace.
After verification, download permission will be granted.

Dataset link:
https://huggingface.co/datasets/Igor97/MISP-QEKS

⚠ Access requests submitted with a different email address will not be approved.

Multimodal Information Based Speech Processing (MISP) 2026 Challenge

MISP-QEKS Dataset

Creation Pipeline

Statistical Information

Downloads