Search Results for author: Bor-Yiing Su

Found 3 papers, 0 papers with code

CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

no code implementations5 Nov 2020 Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark C. Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu

The paper is the first to the extent of our knowledge to perform a data-driven, in-depth analysis of applying partial recovery to recommendation models and identified a trade-off between accuracy and performance.

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

no code implementations20 Mar 2020 Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su, Jiyan Yang, Mikhail Smelyanskiy

Large-scale training is important to ensure high performance and accuracy of machine-learning models.

Distributed, Parallel, and Cluster Computing 68T05, 68M10 H.3.3; I.2.6; C.2.1

Cannot find the paper you are looking for? You can Submit a new open access paper.