Speaker
Description
Recently, self-supervised Transformer-based models have become an integral part of state-of-the-art speech modeling and are being integrated into many speech applications such as Automatic Speech Recognition (ASR), Speaker Verification (SV), Language Identification (LID), emotion detection, etc. These models are trained on datasets comprising tens or even hundreds of thousands of speech and can reach several hundreds of millions of parameters. In my talk, I will briefly overview their architecture and a self-supervised training paradigm based on masked speech prediction. Later on, I will describe a use case in speaker verification where we use these already pre-trained models, which we subsequently fine-tune to serve as powerful feature extractors for speaker embedding extraction. I will also discuss methods that can be employed for fine-tuning such large models when there is only a relatively small amount of target and labeled data available.