GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference

8 May 2020Ali Hadi ZadehAndreas Moshovos

Attention-based models have demonstrated remarkable success in various natural language understanding tasks. However, efficient execution remains a challenge for these models which are memory-bound due to their massive number of parameters... (read more)

PDF Abstract

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods used in the Paper