1 code implementation • 27 Feb 2024 • Tamara Czinczoll, Christoph Hönes, Maximilian Schall, Gerard de Melo
While (large) language models have significantly improved over the last years, they still struggle to sensibly process long sequences found, e. g., in books, due to the quadratic scaling of the underlying attention mechanism.