no code implementations • 12 Nov 2023 • Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Ke Li, Junteng Jia, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer
In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of original LLM capabilities, without using any carefully curated paired data.
no code implementations • 19 Sep 2023 • Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael L. Seltzer, Christian Fuegen
Overall, we demonstrate that by only adding a handful number of trainable parameters via adapters, we can unlock contextualized speech recognition capability for the pretrained LLM while keeping the same text-only input functionality.
no code implementations • 21 Jul 2023 • Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer
Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings.
Abstractive Text Summarization Automatic Speech Recognition +3
no code implementations • CVPR 2023 • Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen
Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16. 9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90, 000 hours).
no code implementations • 7 Nov 2022 • Roshan Sharma, Weipeng He, Ju Lin, Egor Lakomkin, Yang Liu, Kaustubh Kalgaonkar
In this paper, we first demonstrate that egocentric visual information is helpful for noise suppression.
1 code implementation • EMNLP 2018 • Egor Lakomkin, Sven Magg, Cornelius Weber, Stefan Wermter
In this paper, we describe KT-Speech-Crawler: an approach for automatic dataset construction for speech recognition by crawling YouTube videos.
no code implementations • 28 Feb 2019 • Egor Lakomkin, Mohammad Ali Zamani, Cornelius Weber, Sven Magg, Stefan Wermter
We argue that using ground-truth transcriptions during training and evaluation phases leads to a significant discrepancy in performance compared to real-world conditions, as the spoken text has to be recognized on the fly and can contain speech recognition mistakes.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 6 Apr 2018 • Egor Lakomkin, Mohammad Ali Zamani, Cornelius Weber, Sven Magg, Stefan Wermter
Speech emotion recognition (SER) is an important aspect of effective human-robot collaboration and received a lot of attention from the research community.
no code implementations • 3 Apr 2018 • Egor Lakomkin, Mohammad Ali Zamani, Cornelius Weber, Sven Magg, Stefan Wermter
Acoustically expressed emotions can make communication with a robot more efficient.
no code implementations • 30 Mar 2018 • Egor Lakomkin, Chandrakant Bothe, Stefan Wermter
Given the text of a tweet and its emotion category (anger, joy, fear, and sadness), the participants were asked to build a system that assigns emotion intensity values.
no code implementations • IJCNLP 2017 • Egor Lakomkin, Cornelius Weber, Sven Magg, Stefan Wermter
Acoustic emotion recognition aims to categorize the affective state of the speaker and is still a difficult task for machine learning models.
no code implementations • EACL 2017 • Egor Lakomkin, Cornelius Weber, Stefan Wermter
In this work, we tackle a problem of speech emotion classification.
no code implementations • 14 Mar 2018 • Pablo Barros, Nikhil Churamani, Egor Lakomkin, Henrique Siqueira, Alexander Sutherland, Stefan Wermter
This paper is the basis paper for the accepted IJCNN challenge One-Minute Gradual-Emotion Recognition (OMG-Emotion) by which we hope to foster long-emotion classification using neural models for the benefit of the IJCNN community.
Human-Computer Interaction
no code implementations • WS 2017 • Egor Lakomkin, Ch Bothe, rakant, Stefan Wermter
Given the text of a tweet and its emotion category (anger, joy, fear, and sadness), the participants were asked to build a system that assigns emotion intensity values.