Task

Multimodal speaker classification systems typically assume the availability of all modalities during inference. However, in real-world scenarios such as multimedia retrieval, surveillance, and teleconferencing, one or more modalities may be unavailable due to occlusion, sensor failure, or data corruption. Additionally, most existing systems are trained and tested on the same language, limiting their applicability in multilingual environments.

This challenge aims to push the boundaries of:

The goal of the POLYSIM 2026 challenge is to identify speakers across multiple languages when either the face or voice modality is missing. Participants must develop models that can handle three scenarios: (1) voice-only identification when facial data is unavailable, (2) face-only identification when vocal data is unavailable, and (3) multimodal identification when both modalities are available. The challenge focuses on maintaining robust speaker identification performance across different languages (polyglot) even when one modality is completely absent, reflecting real-world constraints where complete multimodal data is not always available.

The task is closed-set speaker classification using:

Participants must design a single unified model that can handle different testing conditions without retraining.

Figure 1: The POLYSIM 2026 Challenge focuses on speaker identification across multiple languages with missing modalities (either face or voice may be absent).

The challenge includes four task settings, covering multimodal, missing-modality, and cross-lingual scenarios.

P3. In-Language Multimodal

This is the standard multimodal setting where both modalities are fully available and no language shift is present.

P4. Missing-Modality (Audio-Only)

The face modality is completely missing at test time. Models must perform speaker classification using audio only, without retraining.

P5. Cross-Lingual Multimodal

This setting evaluates the ability of models to generalize across languages when both modalities are available.

P6. Cross-Lingual Missing-Modality

This is the most challenging scenario, combining:

Setting Training Modalities Testing Modalities Language
P3 Audio + Face Audio + Face Same
P4 Audio + Face Audio only Same
P5 Audio + Face Audio + Face Cross-lingual
P6 Audio + Face Audio only Cross-lingual

Evaluation protocol

Performance will be evaluated using:

We are also considering Equal Error Rate (EER) as an additional metric for evaluating the challenge performance. We expect the challenge participants to submit an output score file for every test pair to indicate how confident the system believes there is a match between the face and voice, or in other words, that the face and voice belong to the same person. The higher the score, the larger the confidence that the face and voice are from the same person. In real-world applications, people may set a threshold to determine if the pair belongs to same or different person as binary output. With a higher threshold, the false acceptance rate (FAR) will become lower, and the false rejection rate (FRR) will become higher. The EER is the optimal point when both errors FAR and FRR are equal. Therefore, EER is suitable to evaluate the performance of systems compared to conventional accuracy since it is independent of the threshold. Finally, a lower EER characterizes a better system. For more information please see the evaluation plan.