GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

30 Jun 2020Dmitry LepikhinHyoukJoong LeeYuanzhong XuDehao ChenOrhan FiratYanping HuangMaxim KrikunNoam ShazeerZhifeng Chen

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices... (read more)

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods used in the Paper