GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference

Attention-based models have demonstrated remarkable success in various natural language understanding tasks. However, efficient execution remains a challenge for these models which are memory-bound due to their massive number of parameters... (read more)

Results in Papers With Code
(↓ scroll down to see all results)