Modeling Deep Learning Accelerator Enabled GPUs

19 Nov 2018 · Md Aamir Raihan, Negar Goli, Tor Aamodt ·

The efficacy of deep learning has resulted in it becoming one of the most important applications run in data centers today. The NVIDIA Tesla V100 GPU introduced a specialized functional unit called the Tensor Core to meet growing demand for higher performance on this workload. To exploit the full capability of current NVIDIA GPUs machine learning researchers have started to use Tensor Cores. For example, 5 out of 6, 2018 Gordon Bell Award Finalists used Tensor Cores in their work. However, currently no open-source GPU microarchitectural simulators model Tensor Cores. In this paper, we comprehensively investigate NVIDIA's Tensor Core implementation found in Volta and Turing architectures and propose an architectural model for it. Our Tensor Core timing model, implemented in GPGPU-Sim, achieves 99.6% IPC correlation versus a physical V100 GPU. Building upon this we also enable GPGPU-Sim to run NVIDIA's CUTLASS, an open-source CUDA C++ templates library providing customizable GEMM templates including the support for Tensor Cores.

PDF Abstract