Boosting Audio-visual Zero-shot Learning with Large Language Models

21 Nov 2023  ยท  Haoxing Chen, Yaohui Li, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Jun Lan, Huijia Zhu, Weiqiang Wang ยท

Audio-visual zero-shot learning aims to recognize unseen categories based on paired audio-visual sequences. Recent methods mainly focus on learning aligned and discriminative multi-modal features to boost generalization towards unseen categories. However, these approaches ignore the obscure action concepts in category names and may inevitably introduce complex network structures with difficult training objectives. In this paper, we propose a simple yet effective framework named Knowledge-aware Distribution Adaptation (KDA) to help the model better grasp the novel action contents with an external knowledge base. Specifically, we first propose using large language models to generate rich descriptions from category names, which leads to a better understanding of unseen categories. Additionally, we propose a distribution alignment loss as well as a knowledge-aware adaptive margin loss to further improve the generalization ability towards unseen categories. Extensive experimental results demonstrate that our proposed KDA can outperform state-of-the-art methods on three popular audio-visual zero-shot learning datasets. Our code will be avaliable at \url{https://github.com/chenhaoxing/KDA}.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
GZSL Video Classification ActivityNet-GZSL (cls) KDA HM 17.95 # 1
ZSL 11.85 # 1
GZSL Video Classification ActivityNet-GZSL(main) KDA HM 19.67 # 1
ZSL 14.00 # 1
GZSL Video Classification UCF-GZSL (cls) KDA HM 54.84 # 1
ZSL 52.66 # 1
GZSL Video Classification UCF-GZSL(main) KDA HM 41.10 # 1
ZSL 28.05 # 1
GZSL Video Classification VGGSound-GZSL (cls) KDA HM 9.78 # 1
ZSL 8.32 # 1
GZSL Video Classification VGGSound-GZSL(main) KDA HM 10.45 # 1
ZSL 8.43 # 1

Methods