2 code implementations • 12 Dec 2023 • Jae-Won Chung, Yile Gu, Insu Jang, Luoxi Meng, Nikhil Bansal, Mosharaf Chowdhury
Training large AI models on numerous GPUs consumes a massive amount of energy.
1 code implementation • 15 Sep 2023 • Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, Mosharaf Chowdhury
Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance.