Dataset
The dataset consists of paired face images/video frames and speech audio samples collected from multiple speakers across different languages. The data is obtained from YouTube videos, consisting of celebrity interviews along with talk shows, and television debates. The visual data spans over a vast range of variations including poses, motion blur, background clutter, video quality, occlusions and lighting conditions. Moreover, most videos contain real-world noise like background chatter, music, over-lapping speech, and compression artifacts, resulting into a challenging dataset to evaluate multimedia systems.
Modalities
Data Splits
Missing Modality Setup
Dataset access
The dataset is available on the following links:
To view the meta-data for the dataset, you can view the PDFs attached below:
The file structure is like:
Baseline systems
We provide multiple baseline systems to help participants get started with the challenge. These baselines cover different modality configurations and fusion strategies:
Link to the paper:
Fusion and Orthogonal Projection for Improved Face-Voice Association
Link to the Baseline code:
https://github.com/msaadsaeed/polysim
Figure 1: Diagram showing our baseline methodology.