Audio Tagging
41 papers with code • 1 benchmarks • 8 datasets
Audio tagging is a task to predict the tags of audio clips. Audio tagging tasks include music tagging, acoustic scene classification, audio event classification, etc.
Libraries
Use these libraries to find Audio Tagging models and implementationsLatest papers
Perceptual Musical Features for Interpretable Audio Tagging
In the age of music streaming platforms, the task of automatically tagging music audio has garnered significant attention, driving researchers to devise methods aimed at enhancing performance metrics on standard datasets.
Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models
Audio Spectrogram Transformers are excellent at exploiting large datasets, creating powerful pre-trained models that surpass CNNs when fine-tuned on downstream tasks.
Audio classification with Dilated Convolution with Learnable Spacings
Dilated convolution with learnable spacings (DCLS) is a recent convolution method in which the positions of the kernel elements are learned throughout training by backpropagation.
Audio Tagging on an Embedded Hardware Platform
In this paper, we analyze how the performance of large-scale pretrained audio neural networks designed for audio pattern recognition changes when deployed on a hardware such as Raspberry Pi.
Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks
In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively.
E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural Networks
Sounds carry an abundance of information about activities and events in our everyday environment, such as traffic noise, road works, music, or people talking.
Robust Cross-Modal Knowledge Distillation for Unconstrained Videos
However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation.
Zorro: the masked multimodal transformer
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering.
Ontology-aware Learning and Evaluation for Audio Tagging
The proposed metric, ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation
We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of . 483 mAP on AudioSet.