1 code implementation • 26 Mar 2024 • Kai Yuan, Christoph Bauinger, Xiangyi Zhang, Pascal Baehr, Matthias Kirchhart, Darius Dabert, Adrien Tousnakhoff, Pierre Boudier, Michael Paulitsch
We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2. 84 in inference and 1. 75 in training.