Track 1 - HumOmni

Task Description

Given a textual context and an audio utterance, with a set of candidate audio responses, select the response that is most empathetic.

Evaluation compares participant predictions against human annotations, so the benchmark emphasizes human-aligned response selection rather than literal semantic overlap alone.

Subtasks

Task 1: Context-Variant

The surrounding conversational context changes, indicating different situations, and the model must determine which candidate response best fits the specific situation.

Illustration for the Context-Variant task.

Data Example (Audio and textual context)

Context 1 During a bachelorette weekend, their friends surprised them with a spa package that included a manicure, and they’ve never been to a nail salon before.	Context 2 After losing a fantasy football bet, the agreed penalty was to get a bright, glittery manicure and keep it for a week.
Utterance Audio
Candidate Responses A B ✓	Candidate Responses A B ✓

Task 2: Tone-Variant

The model must rely on vocal and paralinguistic cues of the utterance to infer the inner state or emotion of the user, and identify which response is emotionally appropriate for the given utterance.

Data Example (Audio and textual context)

Context While helping my mom clear out the guest room before her knee surgery, I pulled out a dusty box labeled "1996-2002" packed with camp Polaroids and birthday party prints. I brought the stack home to sort through after dinner a few nights later.
Utterance Audio: tone 1	Utterance Audio: tone 2
Candidate Responses A ✓ B	Candidate Responses A ✓ B

Evaluation Metrics

Accuracy: for each correctly predicted item, the accuracy score increases by 1.
Grouped bonus: if all items in a context-variant or tone-variant group are predicted correctly, the bonus score increases by 1.
Final Score: (Accuracy + Bonus) / (#data + #group).

Dataset

Please download the training set for each subtask from Hugging Face: gracehuggingface/EmpathyEval

Leaderboard

The leaderboard will report context-variant and tone-variant results separately, together with the weighted average of the final scores.

Model	Context-variant	Tone-variant	Avg.
Qwen-Omni (baseline model)	accuracy / bonus / Final Score	accuracy / bonus / Final Score	Weighted Average of final score

Public scores will be announced after the evaluation process.