1 code implementation • 23 Oct 2023 • Maryam Hashemi, Ghazaleh Mahmoudi, Sara Kodeiri, Hadi Sheikhi, Sauleh Eetemadi
Large-scale pretrained models such as LXMERT are becoming popular for learning cross-modal representations on text-image pairs for vision-language tasks.
Ranked #2 on Visual Question Answering on VQA v2 test-std