Speaker Identification
61 papers with code • 4 benchmarks • 4 datasets
Most implemented papers
Speaker Recognition from Raw Waveform with SincNet
Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants.
Deep Speaker: an End-to-End Neural Speaker Embedding System
We present Deep Speaker, a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity.
ATST: Audio Representation Learning with Teacher-Student Transformer
Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the knowledge to a specific problem with a limited number of labeled data.
Masked Autoencoders that Listen
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
AM-MobileNet1D: A Portable Model for Speaker Recognition
To address this demand, we propose a portable model called Additive Margin MobileNet1D (AM-MobileNet1D) to Speaker Identification on mobile devices.
AutoSpeech: Neural Architecture Search for Speaker Recognition
Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet.
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation
We use the representations with two downstream tasks, speaker identification, and phoneme classification.
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training
We integrate the proposed methods into the HuBERT framework.
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
Learning Speaker Representations with Mutual Information
Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way.