MISP-QEKS Dataset

Large-scale Audio-Visual-Text Benchmark for Query-by-Example Keyword Spotting

The MISP 2026 Challenge uses the MISP-QEKS dataset, which is the first large-scale benchmark explicitly designed for audio–visual–text tri-modal query-by-example keyword spotting (QEKS). MISP-QEKS combines the linguistic richness and speaker diversity of existing open-source corpora with realistic noise conditions, producing a dataset that is both scalable and faithful to real deployment scenarios.


Creation Pipeline

As illustrated in Figure 2, MISP-QEKS is constructed using a hybrid strategy. We start with an automatic pipeline that extracts word-level audio-visual-text clips from sentence-level corpora that provide synchronized audio, video, and transcripts. We then record diverse background noises in everyday environments and mix them with the clean clips at controlled signal-to-noise ratios (SNRs). Finally, all samples are organized into enrollment–query pairs, resulting in the first large-scale corpus for tri-modal QEKS.

MISP-QEKS data processing pipeline: keyword clipping, noisy simulation, and sample-pair construction.

Figure 2: Overview of the MISP-QEKS data processing pipeline: (a) keyword clipping, (b) noisy simulation, (c) enrollment–query pair construction.


Statistical Information

MISP-QEKS is constructed from the Oxford-BBC LRS2 corpus. The dataset provides approximately 193.606 hours of tri-modal data, covering 9,830+ distinct keywords and 610,000 enrollment–query pairs (122,000 positive and 488,000 negative). The dataset is split under a speaker-independent criterion.

Split Duration (h) Keywords Pairs Positive Negative
Train 157.756 8,357 500,000 100,000 400,000
Dev 3.245 2,247 10,000 2,000 8,000
Eval-seen (IV) 15.300 2,174 50,000 10,000 40,000
Eval-blind (OOV) 17.305 1,445 50,000 10,000 40,000

The development set partially overlaps with training keywords for hyperparameter tuning. For evaluation, Eval-seen is drawn from training keywords (in-vocabulary), while Eval-blind contains unseen keywords (out-of-vocabulary), enabling both IV and OOV assessment.

For the MISP 2026 Challenge, we will additionally release a new held-out evaluation set with both seen and unseen keywords to enable fair leaderboard ranking and to assess both in-vocabulary and out-of-vocabulary generalization.

Downloads

The MISP-QEKS dataset is hosted on HuggingFace under controlled access.

Before downloading, participants must:

  1. Complete the Registration Form.
  2. Use the same email address to request dataset access on HuggingFace.
  3. After verification, download permission will be granted.

Dataset link:
https://huggingface.co/datasets/Igor97/MISP-QEKS

⚠ Access requests submitted with a different email address will not be approved.