Large-scale Audio-Visual-Text Benchmark for Query-by-Example Keyword Spotting
The MISP 2026 Challenge uses the MISP-QEKS dataset, which is the first large-scale benchmark explicitly designed for audio–visual–text tri-modal query-by-example keyword spotting (QEKS). MISP-QEKS combines the linguistic richness and speaker diversity of existing open-source corpora with realistic noise conditions, producing a dataset that is both scalable and faithful to real deployment scenarios.
As illustrated in Figure 2, MISP-QEKS is constructed using a hybrid strategy. We start with an automatic pipeline that extracts word-level audio-visual-text clips from sentence-level corpora that provide synchronized audio, video, and transcripts. We then record diverse background noises in everyday environments and mix them with the clean clips at controlled signal-to-noise ratios (SNRs). Finally, all samples are organized into enrollment–query pairs, resulting in the first large-scale corpus for tri-modal QEKS.
Figure 2: Overview of the MISP-QEKS data processing pipeline: (a) keyword clipping, (b) noisy simulation, (c) enrollment–query pair construction.
MISP-QEKS is constructed from the Oxford-BBC LRS2 corpus. The dataset provides approximately 193.606 hours of tri-modal data, covering 9,830+ distinct keywords and 610,000 enrollment–query pairs (122,000 positive and 488,000 negative). The dataset is split under a speaker-independent criterion.
| Split | Duration (h) | Keywords | Pairs | Positive | Negative |
|---|---|---|---|---|---|
| Train | 157.756 | 8,357 | 500,000 | 100,000 | 400,000 |
| Dev | 3.245 | 2,247 | 10,000 | 2,000 | 8,000 |
| Eval-seen (IV) | 15.300 | 2,174 | 50,000 | 10,000 | 40,000 |
| Eval-blind (OOV) | 17.305 | 1,445 | 50,000 | 10,000 | 40,000 |
The development set partially overlaps with training keywords for hyperparameter tuning. For evaluation, Eval-seen is drawn from training keywords (in-vocabulary), while Eval-blind contains unseen keywords (out-of-vocabulary), enabling both IV and OOV assessment.
For the MISP 2026 Challenge, we will additionally release a new held-out evaluation set with both seen and unseen keywords to enable fair leaderboard ranking and to assess both in-vocabulary and out-of-vocabulary generalization.
The MISP-QEKS dataset is hosted on HuggingFace under controlled access.
Before downloading, participants must:
Dataset link:
https://huggingface.co/datasets/Igor97/MISP-QEKS
⚠ Access requests submitted with a different email address will not be approved.