no code implementations • 6 Feb 2024 • Michael Zhang, Kush Bhatia, Hermann Kumbong, Christopher Ré
Experiments show Hedgehog recovers over 99% of standard Transformer quality in train-from-scratch and finetuned-conversion settings, outperforming prior linear attentions up to 6 perplexity points on WikiText-103 with causal GPTs, and up to 8. 7 GLUE score points on finetuned bidirectional BERTs.
1 code implementation • 10 Nov 2023 • Daniel Y. Fu, Hermann Kumbong, Eric Nguyen, Christopher Ré
FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O.