Speaker Recognition is the process of identifying or confirming the identity of a person given his speech segments.
Deep neural networks can learn complex and abstract representations, that are progressively obtained by combining simpler ones.
Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants.
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker.
Also, we validate the use of parameterized filterbanks and show that complex-valued representations and masks are beneficial in all conditions.
We present Deep Speaker, a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity.
In our experiments, we show that through alteration along different dimensions, the model learns to encode distinct aspects of speech.
We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech.
The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.
Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet.
Ranked #1 on Speaker Identification on VoxCeleb1