Search Results for author: William Won

Found 4 papers, 1 papers with code

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

no code implementations • 11 Apr 2023 • William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Ajaya Durg, Samvit Kaul, Swati Gupta, Tushar Krishna

To this end, this paper introduces TACOS, an automated synthesizer that generates topology-aware collective algorithms for common distributed machine learning collectives across arbitrary input network topologies.

Paper
Add Code

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

3 code implementations • 24 Mar 2023 • William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna

In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms.

192

Paper
Code

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

no code implementations • 9 Oct 2021 • Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e. g., GPU/TPU).

Scheduling

Paper
Add Code

Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models

no code implementations • 24 Sep 2021 • William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna

High-performance distributed training platforms should leverage multi-dimensional hierarchical networks, which interconnect accelerators through different levels of the network, to dramatically reduce expensive NICs required for the scale-out network.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.