AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

Henry Zhong1, Jörg M. Buchholz1, Julian Maclaren2, Simon Carlile2, Richard F. Lyon2

1 Macquarie University 2 Google Research Australia


Project Website

Paper

Code

Supplementary Data

Demo Website With Alignment Samples


BibTeX

@misc{zhong2026auditoryhumauditoryscenelabel,
      title={AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration}, 
      author={Henry Zhong and Jörg M. Buchholz and Julian Maclaren and Simon Carlile and Richard F. Lyon},
      year={2026},
      eprint={2602.19409},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2602.19409}, 
}
                    

Samples With Alignment Score

This website presents samples of the audio, label, and alignment scores used in the AuditoryHuM paper. The labels were generated using Qwen 2.5 Omni 3B. The alignment scores were generated using Human-CLAP. Low alignment scores typically indicate any of the following: very noisy, very quiet, synthetically altered, and/or complicated multilayered soundscapes.


Top 3 Audio Alignment Scores For ADVANCE
Filename Label Alignment Score Sample
03231.wav birds tweeting 0.911
08965.wav toilet flushing 0.910
06952.wav snoring 0.900
Bottom 3 Audio Alignment Scores for ADVANCE
Filename Label Alignment Score Sample
07090.wav clicking -0.261
08545.wav travel -0.053
09182.wav beep -0.002
Top 3 Audio Alignment Scores for AHEAD-DS
Filename Label Alignment Score Sample
wind_turbulence_00348.wav wind blowing 0.946
wind_turbulence_00154.wav wind blowing 0.940
wind_turbulence_00362.wav wind blowing 0.939
Bottom 3 Audio Alignment Scores for AHEAD-DS
Filename Label Alignment Score Sample
speech_in_wind_turbulence_00280.wav <blank> -0.235
interfering_speakers_00948.wav beep -0.225
speech_in_wind_turbulence_00165.wav <blank> -0.217
Top 3 Audio Alignment Scores for TAU 2019
Filename Label Alignment Score Sample
street_traffic-lisbon-1171-45694-a.wav car passing 0.829
street_traffic-lisbon-1008-43763-a.wav car passing 0.828
street_traffic-stockholm-174-5344-a.wav car passing 0.822
Bottom 3 Audio Alignment Scores for TAU 2019
Filename Label Alignment Score Sample
metro_station-london-76-2113-a.wav footsteps -0.197
metro-stockholm-55-1618-a.wav beep -0.068
tram-barcelona-179-5537-a.wav beep -0.054