Track 1 - EmpathyEval

Evaluation of Contextualized Affective Speech

This track evaluates whether omni-models can identify the most appropriate spoken response by combining contextual understanding with affective and paralinguistic reasoning, advancing AI capabilities for empathetic perception in speech.


Task Description

Given a textual context and an audio utterance, with a set of candidate audio responses, select the response that is most empathetic.

Evaluation compares participant predictions against human annotations, so the benchmark emphasizes human-aligned response selection rather than literal semantic overlap alone.

Subtasks

Task 1: Context-Variant

The surrounding conversational context changes, indicating different situations, and the model must determine which candidate response best fits the specific situation.

Illustration for the Context-Variant task.

Data Example (Audio and textual context)

Context 1

During a bachelorette weekend, their friends surprised them with a spa package that included a manicure, and they’ve never been to a nail salon before.

Context 2

After losing a fantasy football bet, the agreed penalty was to get a bright, glittery manicure and keep it for a week.

Utterance Audio
Candidate Responses
A
B
Candidate Responses
A
B

Task 2: Tone-Variant

The model must rely on vocal and paralinguistic cues of the utterance to infer the inner state or emotion of the user, and identify which response is emotionally appropriate for the given utterance.

Illustration for the Tone-Variant task.

Data Example (Audio and textual context)

Context

While helping my mom clear out the guest room before her knee surgery, I pulled out a dusty box labeled "1996-2002" packed with camp Polaroids and birthday party prints. I brought the stack home to sort through after dinner a few nights later.

Utterance Audio: tone 1
Utterance Audio: tone 2
Candidate Responses
A
B
Candidate Responses
A
B

Evaluation Metrics

  • Accuracy: for each correctly predicted item, the accuracy score increases by 1.
  • Grouped bonus: if all items in a context-variant or tone-variant group are predicted correctly, the bonus score increases by 1.
  • Final Score: (Accuracy + Bonus) / (#data + #group).

Dataset

Please download the training set for each subtask from Hugging Face: gracehuggingface/EmpathyEval

Leaderboard

The leaderboard will report context-variant and tone-variant results separately, together with the weighted average of the final scores.

Model Context-variant Tone-variant Avg.
Qwen-Omni
(baseline model)
accuracy / bonus / Final Score accuracy / bonus / Final Score Weighted Average of final score

Public scores will be announced after the evaluation process.