Dataset

The dataset consists of paired face images/video frames and speech audio samples collected from multiple speakers across different languages. The data is obtained from YouTube videos, consisting of celebrity interviews along with talk shows, and television debates. The visual data spans over a vast range of variations including poses, motion blur, background clutter, video quality, occlusions and lighting conditions. Moreover, most videos contain real-world noise like background chatter, music, over-lapping speech, and compression artifacts, resulting into a challenging dataset to evaluate multimedia systems.

Modalities

Data Splits

Missing Modality Setup

Dataset access

The dataset is available on the following links:

To view the meta-data for the dataset, you can view the PDFs attached below:

The file structure is like:

Baseline systems

We provide multiple baseline systems to help participants get started with the challenge. These baselines cover different modality configurations and fusion strategies:

Link to the paper:
Fusion and Orthogonal Projection for Improved Face-Voice Association

Link to the Baseline code:
https://github.com/msaadsaeed/polysim

Figure 1: Diagram showing our baseline methodology.