The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.
Ranked #2 on Multimodal Machine Translation on Multi30K (BLUE (DE-EN) metric)
The proposed technical method initiates by separating sentences of the given text and utilizing the BERT model to generate embeddings for each one.
Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks.
Ranked #1 on Natural Language Inference on V-SNLI (using extra training data)
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks.
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.
Ranked #4 on Visual Question Answering on BenchLMM
Distillation from Weak Teacher (DWT) is a method of transferring knowledge from a smaller, weaker teacher model to a larger student model to improve its performance.
We propose \textit{RAG-end2end}, an extension to RAG, that can adapt to a domain-specific knowledge base by updating all components of the external knowledge base during training.
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
Ranked #1 on Speech Recognition on TIMIT (using extra training data)