Search Results for author: Shaohuai Shi

Found 35 papers, 15 papers with code

FedImpro: Measuring and Improving Client Update in Federated Learning

no code implementations • 10 Feb 2024 • Zhenheng Tang, Yonggang Zhang, Shaohuai Shi, Xinmei Tian, Tongliang Liu, Bo Han, Xiaowen Chu

First, we analyze the generalization contribution of local training and conclude that this generalization contribution is bounded by the conditional Wasserstein distance between the data distribution of different clients.

Federated Learning

Paper
Add Code

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

no code implementations • 7 Nov 2023 • Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu

For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs.

Quantization

Paper
Add Code

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

no code implementations • 3 Sep 2023 • Zhenheng Tang, Yuxin Wang, Xin He, Longteng Zhang, Xinglin Pan, Qiang Wang, Rongfei Zeng, Kaiyong Zhao, Shaohuai Shi, Bingsheng He, Xiaowen Chu

The rapid growth of memory and computation requirements of large language models (LLMs) has outpaced the development of hardware, hindering people who lack large-scale high-end GPUs from training or deploying LLMs.

Scheduling

Paper
Add Code

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

no code implementations • 7 Aug 2023 • Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, Bo Li

The low-rank adaptation (LoRA) method can largely reduce the amount of trainable parameters for fine-tuning large language models (LLMs), however, it still requires expensive activation memory to update low-rank weights.

Paper
Add Code

Eva: A General Vectorized Approximation Framework for Second-order Optimization

no code implementations • 4 Aug 2023 • Lin Zhang, Shaohuai Shi, Bo Li

Second-order optimization algorithms exhibit excellent convergence properties for training deep learning models, but often incur significant computation and memory overheads.

Paper
Add Code

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

1 code implementation • 15 Jun 2023 • Lin Zhang, Longteng Zhang, Shaohuai Shi, Xiaowen Chu, Bo Li

To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear.

Quantization

Paper
Code

FedML Parrot: A Scalable Federated Learning System via Heterogeneity-aware Scheduling on Sequential and Hierarchical Training

1 code implementation • 3 Mar 2023 • Zhenheng Tang, Xiaowen Chu, Ryan Yide Ran, Sunwoo Lee, Shaohuai Shi, Yonggang Zhang, Yuxin Wang, Alex Qiaozhong Liang, Salman Avestimehr, Chaoyang He

It improves the training efficiency, remarkably relaxes the requirements on the hardware, and supports efficient large-scale FL experiments with stateful clients by: (1) sequential training clients on devices; (2) decomposing original aggregation into local and global aggregation on devices and server respectively; (3) scheduling tasks to mitigate straggler problems and enhance computing utility; (4) distributed client state manager to support various FL algorithms.

Federated Learning Scheduling

4,058

Paper
Code

DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining

1 code implementation • 24 Feb 2023 • Lin Zhang, Shaohuai Shi, Xiaowen Chu, Wei Wang, Bo Li, Chengjian Liu

Communication scheduling has been shown to be effective in accelerating distributed training, which enables all-reduce communications to be overlapped with backpropagation computations.

Scheduling

Paper
Code

An Efficient Split Fine-tuning Framework for Edge and Cloud Collaborative Learning

no code implementations • 30 Nov 2022 • Shaohuai Shi, Qing Yang, Yang Xiang, Shuhan Qi, Xuan Wang

To enable the pre-trained models to be fine-tuned with local data on edge devices without sharing data with the cloud, we design an efficient split fine-tuning (SFT) framework for edge and cloud collaborative learning.

Paper
Add Code

EASNet: Searching Elastic and Accurate Network Architecture for Stereo Matching

1 code implementation • 20 Jul 2022 • Qiang Wang, Shaohuai Shi, Kaiyong Zhao, Xiaowen Chu

However, existing NAS studies on the dense prediction task, especially stereo matching, still cannot be efficiently and effectively deployed on devices of different computing capabilities.

Image Classification Neural Architecture Search +3

Paper
Code

Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning

1 code implementation • 30 Jun 2022 • Lin Zhang, Shaohuai Shi, Wei Wang, Bo Li

The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training on GPU clusters.

Paper
Code

Virtual Homogeneity Learning: Defending against Data Heterogeneity in Federated Learning

1 code implementation • 6 Jun 2022 • Zhenheng Tang, Yonggang Zhang, Shaohuai Shi, Xin He, Bo Han, Xiaowen Chu

In federated learning (FL), model performance typically suffers from client drift induced by data heterogeneity, and mainstream works focus on correcting client drift.

Federated Learning

Paper
Code

Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters

1 code implementation • 19 May 2022 • Yang Xiang, Zhihua Wu, Weibao Gong, Siyu Ding, Xianjie Mo, Yuang Liu, Shuohuan Wang, Peng Liu, Yongshuai Hou, Long Li, Bin Wang, Shaohuai Shi, Yaqian Han, Yue Yu, Ge Li, Yu Sun, Yanjun Ma, dianhai yu

We took natural language processing (NLP) as an example to show how Nebula-I works in different training phases that include: a) pre-training a multilingual language model using two remote clusters; and b) fine-tuning a machine translation model using knowledge distilled from pre-trained models, which run through the most popular paradigm of recent deep learning.

Cross-Lingual Natural Language Inference Distributed Computing +2

21,607

Paper
Code

FADNet++: Real-Time and Accurate Disparity Estimation with Configurable Networks

no code implementations • 6 Oct 2021 • Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, Xiaowen Chu

The disparity estimation problem tends to be addressed by DNNs which achieve much better prediction accuracy than traditional hand-crafted feature-based methods.

Disparity Estimation

Paper
Add Code

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

no code implementations • 14 Jul 2021 • Shaohuai Shi, Lin Zhang, Bo Li

Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters.

Paper
Add Code

Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT Scans

2 code implementations • 14 Jan 2021 • Xin He, Shihao Wang, Xiaowen Chu, Shaohuai Shi, Jiangping Tang, Xin Liu, Chenggang Yan, Jiyong Zhang, Guiguang Ding

The experimental results show that our automatically searched models (CovidNet3D) outperform the baseline human-designed models on the three datasets with tens of times smaller model size and higher accuracy.

Benchmarking Medical Diagnosis +1

Paper
Code

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

no code implementations • 20 Oct 2020 • Shaohuai Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xue Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Xianbin Ouyang, Yan Zhang, Jieqian Wei, Jing Gong, Weiliang Lin, Ping Gao, Peng Meng, Xiaomin Xu, Chenyang Guo, Bo Yang, Zhibo Chen, Yongjian Wu, Xiaowen Chu

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters.

Paper
Add Code

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

1 code implementation • 27 May 2020 • Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Chengjian Liu, Wei Wang, Bo Li

In this article, we present a quantitative survey of communication optimization techniques for data parallel distributed DL.

Scheduling

Paper
Code

FADNet: A Fast and Accurate Network for Disparity Estimation

1 code implementation • 24 Mar 2020 • Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, Xiaowen Chu

Deep neural networks (DNNs) have achieved great success in the area of computer vision.

Disparity Estimation Scheduling +1

135

Paper
Code

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

no code implementations • 10 Mar 2020 • Zhenheng Tang, Shaohuai Shi, Wei Wang, Bo Li, Xiaowen Chu

In this paper, we provide a comprehensive survey of the communication-efficient distributed training algorithms, focusing on both system-level and algorithmic-level optimizations.

Paper
Add Code

Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

no code implementations • 24 Feb 2020 • Qiang Wang, Shaohuai Shi, Canhui Wang, Xiaowen Chu

We thus propose a provable algorithm, AdaDUAL, to efficiently schedule those communication tasks.

Scheduling

Paper
Add Code

Communication-Efficient Decentralized Learning with Sparsification and Adaptive Peer Selection

no code implementations • 22 Feb 2020 • Zhenheng Tang, Shaohuai Shi, Xiaowen Chu

2) Each worker only needs to communicate with a single peer at each communication round with a highly compressed model, which can significantly reduce the communication traffic on the worker.

Federated Learning

Paper
Add Code

MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning

1 code implementation • 18 Dec 2019 • Shaohuai Shi, Xiaowen Chu, Bo Li

Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks (DNNs) on computer clusters.

Playing the Game of 2048

Paper
Code

Understanding Top-k Sparsification in Distributed Deep Learning

1 code implementation • 20 Nov 2019 • Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, Simon See

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck.

Paper
Code

Computer-Aided Clinical Skin Disease Diagnosis Using CNN and Object Detection Models

no code implementations • 20 Nov 2019 • Xin He, Shihao Wang, Shaohuai Shi, Zhenheng Tang, Yuxin Wang, Zhihao Zhao, Jing Dai, Ronghao Ni, Xiaofeng Zhang, Xiaoming Liu, Zhili Wu, Wu Yu, Xiaowen Chu

Our results show that object detection can help improve the accuracy of some skin disease classes.

object-detection Object Detection

Paper
Add Code

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

no code implementations • 20 Nov 2019 • Shaohuai Shi, Zhenheng Tang, Qiang Wang, Kaiyong Zhao, Xiaowen Chu

To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers.

Distributed Optimization

Paper
Add Code

Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training

no code implementations • 15 Sep 2019 • Yuxin Wang, Qiang Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Kaiyong Zhao, Xiaowen Chu

Different from the existing end-to-end benchmarks which only present the training time, We try to investigate the impact of hardware, vendor's software library, and deep learning framework on the performance and energy consumption of AI training.

Benchmarking

Paper
Add Code

A Distributed Synchronous SGD Algorithm with Global Top-$k$ Sparsification for Low Bandwidth Networks

1 code implementation • 14 Jan 2019 • Shaohuai Shi, Qiang Wang, Kaiyong Zhao, Zhenheng Tang, Yuxin Wang, Xiang Huang, Xiaowen Chu

Current methods that use AllGather to accumulate the sparse gradients have a communication complexity of $O(kP)$, where $P$ is the number of workers, which is inefficient on low bandwidth networks with a large number of workers.

Paper
Code

MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

2 code implementations • 27 Nov 2018 • Shaohuai Shi, Xiaowen Chu, Bo Li

Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks on computer clusters.

Distributed, Parallel, and Cluster Computing

Paper
Code

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

no code implementations • 30 Jul 2018 • Xianyan Jia, Shutao Song, wei he, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, Xiaowen Chu

(3) We propose highly optimized all-reduce algorithms that achieve up to 3x and 11x speedup on AlexNet and ResNet-50 respectively than NCCL-based training on a cluster with 1024 Tesla P40 GPUs.

Playing the Game of 2048

Paper
Add Code

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

1 code implementation • 16 Nov 2017 • Shaohuai Shi, Xiaowen Chu

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry.

Distributed, Parallel, and Cluster Computing

Paper
Code

Performance Evaluation of Deep Learning Tools in Docker Containers

no code implementations • 9 Nov 2017 • Pengfei Xu, Shaohuai Shi, Xiaowen Chu

We first benchmark the performance of system components (IO, CPU and GPU) in a docker container and the host system and compare the results to see if there's any difference.

Management

Paper
Add Code

Speeding up Convolutional Neural Networks By Exploiting the Sparsity of Rectifier Units

no code implementations • 25 Apr 2017 • Shaohuai Shi, Xiaowen Chu

Rectifier neuron units (ReLUs) have been widely used in deep convolutional networks.

Paper
Add Code

Supervised Learning Based Algorithm Selection for Deep Neural Networks

no code implementations • 10 Feb 2017 • Shaohuai Shi, Pengfei Xu, Xiaowen Chu

In this paper, we target at optimizing the operations of multiplying a matrix with the transpose of another matrix (referred to as NT operation hereafter), which contribute about half of the training time of fully connected deep neural networks.

Paper
Add Code

Benchmarking State-of-the-Art Deep Learning Software Tools

no code implementations • 25 Aug 2016 • Shaohuai Shi, Qiang Wang, Pengfei Xu, Xiaowen Chu

We first benchmark the running performance of these tools with three popular types of neural networks on two CPU platforms and three GPU platforms.

Benchmarking

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.