Search Results

Attention Is All You Need

huggingface/transformers NeurIPS 2017

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.

Ranked #2 on Multimodal Machine Translation on Multi30K (BLUE (DE-EN) metric)

Abstractive Text Summarization Coreference Resolution +8

ColBERT: Using BERT Sentence Embedding in Parallel Neural Networks for Computational Humor

huggingface/transformers 27 Apr 2020

The proposed technical method initiates by separating sentences of the given text and utilizing the BERT model to generate embeddings for each one.

Humor Detection Sentence +2

Supervised Multimodal Bitransformers for Classifying Images and Text

huggingface/transformers 6 Sep 2019

Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks.

 Ranked #1 on Natural Language Inference on V-SNLI (using extra training data)

General Classification Natural Language Inference

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

huggingface/transformers ACL 2020

Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.

Transformers: State-of-the-Art Natural Language Processing

huggingface/transformers EMNLP 2020

Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks.

Image Classification Object Recognition +1

Visual Instruction Tuning

huggingface/transformers NeurIPS 2023

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.

Video Question Answering visual instruction following +2

A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models

huggingface/transformers 26 May 2023

Distillation from Weak Teacher (DWT) is a method of transferring knowledge from a smaller, weaker teacher model to a larger student model to improve its performance.

Knowledge Distillation

Multilingual Denoising Pre-training for Neural Machine Translation

huggingface/transformers 22 Jan 2020

This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks.

Denoising Sentence +2

Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering

huggingface/transformers 6 Oct 2022

We propose \textit{RAG-end2end}, an extension to RAG, that can adapt to a domain-specific knowledge base by updating all components of the external knowledge base during training.

Domain Adaptation Information Retrieval +3

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

huggingface/transformers NeurIPS 2020

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

 Ranked #1 on Speech Recognition on TIMIT (using extra training data)

Quantization Self-Supervised Learning +1