no code implementations • 2 Apr 2024 • Maksim Dzabraev, Alexander Kunitsyn, Andrei Ivaniuta
In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models.
no code implementations • 14 Mar 2022 • Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei Ivaniuta
In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model.
Ranked #1 on Video Retrieval on TGIF (using extra training data)
3 code implementations • 19 Mar 2021 • Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, Aleksandr Petiushko
We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin.
Ranked #25 on Video Retrieval on LSMDC (using extra training data)
1 code implementation • 4 Nov 2020 • Stepan Komkov, Maksim Dzabraev, Aleksandr Petiushko
In this paper, we explore the various methods to embed the ensemble power into a single model.
Ranked #47 on Action Recognition on Something-Something V2 (using extra training data)