@misc{zhong2026auditoryhumauditoryscenelabel,
title={AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration},
author={Henry Zhong and Jörg M. Buchholz and Julian Maclaren and Simon Carlile and Richard F. Lyon},
year={2026},
eprint={2602.19409},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2602.19409},
}
Samples With Alignment Score
This website presents samples of the audio, label, and alignment scores used in the AuditoryHuM paper. The labels were generated using Qwen 2.5 Omni 3B. The alignment scores were generated using Human-CLAP. Low alignment scores typically indicate any of the following: very noisy, very quiet, synthetically altered, and/or complicated multilayered soundscapes.