Search Results for author: Xiaowen Chu

Found 50 papers, 25 papers with code

VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting

1 code implementation25 Mar 2024 Yujin Tang, Peijie Dong, Zhenheng Tang, Xiaowen Chu, Junwei Liang

Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics.

BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

1 code implementation16 Feb 2024 Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu

The upscaling of Large Language Models (LLMs) has yielded impressive advances in natural language processing, yet it also poses significant deployment challenges.

Knowledge Distillation Quantization

FedImpro: Measuring and Improving Client Update in Federated Learning

no code implementations10 Feb 2024 Zhenheng Tang, Yonggang Zhang, Shaohuai Shi, Xinmei Tian, Tongliang Liu, Bo Han, Xiaowen Chu

First, we analyze the generalization contribution of local training and conclude that this generalization contribution is bounded by the conditional Wasserstein distance between the data distribution of different clients.

Federated Learning

ParZC: Parametric Zero-Cost Proxies for Efficient NAS

no code implementations3 Feb 2024 Peijie Dong, Lujun Li, Xinglin Pan, Zimian Wei, Xiang Liu, Qiang Wang, Xiaowen Chu

Recent advancements in Zero-shot Neural Architecture Search (NAS) highlight the efficacy of zero-cost proxies in various NAS benchmarks.

Neural Architecture Search

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

no code implementations7 Nov 2023 Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu

For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs.

Quantization

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

no code implementations3 Sep 2023 Zhenheng Tang, Yuxin Wang, Xin He, Longteng Zhang, Xinglin Pan, Qiang Wang, Rongfei Zeng, Kaiyong Zhao, Shaohuai Shi, Bingsheng He, Xiaowen Chu

The rapid growth of memory and computation requirements of large language models (LLMs) has outpaced the development of hardware, hindering people who lack large-scale high-end GPUs from training or deploying LLMs.

Scheduling

EnsembleFollower: A Hybrid Car-Following Framework Based On Reinforcement Learning and Hierarchical Planning

no code implementations30 Aug 2023 Xu Han, Xianda Chen, Meixin Zhu, Pinlong Cai, Jianshan Zhou, Xiaowen Chu

The experimental results illustrate that EnsembleFollower yields improved accuracy of human-like behavior and achieves effectiveness in combining hybrid models, demonstrating that our proposed framework can handle diverse car-following conditions by leveraging the strengths of various low-level models.

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

no code implementations7 Aug 2023 Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, Bo Li

The low-rank adaptation (LoRA) method can largely reduce the amount of trainable parameters for fine-tuning large language models (LLMs), however, it still requires expensive activation memory to update low-rank weights.

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

1 code implementation15 Jun 2023 Lin Zhang, Longteng Zhang, Shaohuai Shi, Xiaowen Chu, Bo Li

To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear.

Quantization

FedML Parrot: A Scalable Federated Learning System via Heterogeneity-aware Scheduling on Sequential and Hierarchical Training

1 code implementation3 Mar 2023 Zhenheng Tang, Xiaowen Chu, Ryan Yide Ran, Sunwoo Lee, Shaohuai Shi, Yonggang Zhang, Yuxin Wang, Alex Qiaozhong Liang, Salman Avestimehr, Chaoyang He

It improves the training efficiency, remarkably relaxes the requirements on the hardware, and supports efficient large-scale FL experiments with stateful clients by: (1) sequential training clients on devices; (2) decomposing original aggregation into local and global aggregation on devices and server respectively; (3) scheduling tasks to mitigate straggler problems and enhance computing utility; (4) distributed client state manager to support various FL algorithms.

Federated Learning Scheduling

DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining

1 code implementation24 Feb 2023 Lin Zhang, Shaohuai Shi, Xiaowen Chu, Wei Wang, Bo Li, Chengjian Liu

Communication scheduling has been shown to be effective in accelerating distributed training, which enables all-reduce communications to be overlapped with backpropagation computations.

Scheduling

Rethinking Disparity: A Depth Range Free Multi-View Stereo Based on Disparity

1 code implementation30 Nov 2022 Qingsong Yan, Qiang Wang, Kaiyong Zhao, Bo Li, Xiaowen Chu, Fei Deng

Existing learning-based multi-view stereo (MVS) methods rely on the depth range to build the 3D cost volume and may fail when the range is too large or unreliable.

NAS-LID: Efficient Neural Architecture Search with Local Intrinsic Dimension

1 code implementation23 Nov 2022 Xin He, Jiangchao Yao, Yuxin Wang, Zhenheng Tang, Ka Chu Cheung, Simon See, Bo Han, Xiaowen Chu

One-shot neural architecture search (NAS) substantially improves the search efficiency by training one supernet to estimate the performance of every possible child architecture (i. e., subnet).

Neural Architecture Search

SphereDepth: Panorama Depth Estimation from Spherical Domain

no code implementations29 Aug 2022 Qingsong Yan, Qiang Wang, Kaiyong Zhao, Bo Li, Xiaowen Chu, Fei Deng

The panorama image can simultaneously demonstrate complete information of the surrounding environment and has many advantages in virtual tourism, games, robotics, etc.

Depth Estimation

EASNet: Searching Elastic and Accurate Network Architecture for Stereo Matching

1 code implementation20 Jul 2022 Qiang Wang, Shaohuai Shi, Kaiyong Zhao, Xiaowen Chu

However, existing NAS studies on the dense prediction task, especially stereo matching, still cannot be efficiently and effectively deployed on devices of different computing capabilities.

Image Classification Neural Architecture Search +3

Virtual Homogeneity Learning: Defending against Data Heterogeneity in Federated Learning

1 code implementation6 Jun 2022 Zhenheng Tang, Yonggang Zhang, Shaohuai Shi, Xin He, Bo Han, Xiaowen Chu

In federated learning (FL), model performance typically suffers from client drift induced by data heterogeneity, and mainstream works focus on correcting client drift.

Federated Learning

AdaProp: Learning Adaptive Propagation for Graph Neural Network based Knowledge Graph Reasoning

2 code implementations30 May 2022 Yongqi Zhang, Zhanke Zhou, Quanming Yao, Xiaowen Chu, Bo Han

An important design component of GNN-based KG reasoning methods is called the propagation path, which contains a set of involved entities in each propagation step.

Knowledge Graphs

EAGAN: Efficient Two-stage Evolutionary Architecture Search for GANs

1 code implementation30 Nov 2021 Guohao Ying, Xin He, Bin Gao, Bo Han, Xiaowen Chu

Some recent works try to search both generator (G) and discriminator (D), but they suffer from the instability of GAN training.

Image Generation Neural Architecture Search +2

FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks

1 code implementation22 Nov 2021 Chaoyang He, Alay Dilipbhai Shah, Zhenheng Tang, Di Fan1Adarshan Naiynar Sivashunmugam, Keerti Bhogaraju, Mita Shimpi, Li Shen, Xiaowen Chu, Mahdi Soltanolkotabi, Salman Avestimehr

To bridge the gap and facilitate the development of FL for computer vision tasks, in this work, we propose a federated learning library and benchmarking framework, named FedCV, to evaluate FL on the three most representative computer vision tasks: image classification, image segmentation, and object detection.

Benchmarking Federated Learning +5

Embracing Structure in Data for Billion-Scale Semantic Product Search

no code implementations12 Oct 2021 Vihan Lakshman, Choon Hui Teo, Xiaowen Chu, Priyanka Nigam, Abhinandan Patni, Pooja Maknikar, SVN Vishwanathan

When training a dyadic model, one seeks to embed two different types of entities (e. g., queries and documents or users and movies) in a common vector space such that pairs with high relevance are positioned nearby.

FADNet++: Real-Time and Accurate Disparity Estimation with Configurable Networks

no code implementations6 Oct 2021 Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, Xiaowen Chu

The disparity estimation problem tends to be addressed by DNNs which achieve much better prediction accuracy than traditional hand-crafted feature-based methods.

Disparity Estimation

A Comprehensive Survey of Incentive Mechanism for Federated Learning

no code implementations27 Jun 2021 Rongfei Zeng, Chao Zeng, Xingwei Wang, Bo Li, Xiaowen Chu

Federated learning utilizes various resources provided by participants to collaboratively train a global model, which potentially address the data privacy issue of machine learning.

Federated Learning

Evolutionary Multi-objective Architecture Search Framework: Application to COVID-19 3D CT Classification

1 code implementation26 Jan 2021 Xin He, Guohao Ying, Jiyong Zhang, Xiaowen Chu

We propose a new objective, namely potential, which can help exploit promising models to indirectly reduce the number of models involved in weights training, thus alleviating search instability.

Computed Tomography (CT) Medical Diagnosis +1

Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT Scans

2 code implementations14 Jan 2021 Xin He, Shihao Wang, Xiaowen Chu, Shaohuai Shi, Jiangping Tang, Xin Liu, Chenggang Yan, Jiyong Zhang, Guiguang Ding

The experimental results show that our automatically searched models (CovidNet3D) outperform the baseline human-designed models on the three datasets with tens of times smaller model size and higher accuracy.

Benchmarking Medical Diagnosis +1

EDNet: Efficient Disparity Estimation with Cost Volume Combination and Attention-based Spatial Residual

no code implementations CVPR 2021 Songyan Zhang, Zhicheng Wang, Qiang Wang, Jinshuo Zhang, Gang Wei, Xiaowen Chu

Existing state-of-the-art disparity estimation works mostly leverage the 4D concatenation volume and construct a very deep 3D convolution neural network (CNN) for disparity regression, which is inefficient due to the high memory consumption and slow inference speed.

Disparity Estimation Stereo Matching

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

1 code implementation27 May 2020 Shaohuai Shi, Zhenheng Tang, Xiaowen Chu, Chengjian Liu, Wei Wang, Bo Li

In this article, we present a quantitative survey of communication optimization techniques for data parallel distributed DL.

Scheduling

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

no code implementations10 Mar 2020 Zhenheng Tang, Shaohuai Shi, Wei Wang, Bo Li, Xiaowen Chu

In this paper, we provide a comprehensive survey of the communication-efficient distributed training algorithms, focusing on both system-level and algorithmic-level optimizations.

Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

no code implementations24 Feb 2020 Qiang Wang, Shaohuai Shi, Canhui Wang, Xiaowen Chu

We thus propose a provable algorithm, AdaDUAL, to efficiently schedule those communication tasks.

Scheduling

Communication-Efficient Decentralized Learning with Sparsification and Adaptive Peer Selection

no code implementations22 Feb 2020 Zhenheng Tang, Shaohuai Shi, Xiaowen Chu

2) Each worker only needs to communicate with a single peer at each communication round with a highly compressed model, which can significantly reduce the communication traffic on the worker.

Federated Learning

FMore: An Incentive Scheme of Multi-dimensional Auction for Federated Learning in MEC

no code implementations22 Feb 2020 Rongfei Zeng, Shixun Zhang, Jiaqi Wang, Xiaowen Chu

In MEC, edge nodes would not like to voluntarily participate in learning, and they differ in the provision of multi-dimensional resources, both of which might deteriorate the performance of federated learning.

Edge-computing Federated Learning

A Survey of Deep Learning Techniques for Neural Machine Translation

1 code implementation18 Feb 2020 Shuoheng Yang, Yuxin Wang, Xiaowen Chu

In recent years, natural language processing (NLP) has got great development with deep learning techniques.

Machine Translation NMT +1

MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning

1 code implementation18 Dec 2019 Shaohuai Shi, Xiaowen Chu, Bo Li

Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks (DNNs) on computer clusters.

Playing the Game of 2048

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

no code implementations20 Nov 2019 Shaohuai Shi, Zhenheng Tang, Qiang Wang, Kaiyong Zhao, Xiaowen Chu

To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers.

Distributed Optimization

Understanding Top-k Sparsification in Distributed Deep Learning

1 code implementation20 Nov 2019 Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, Simon See

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck.

Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training

no code implementations15 Sep 2019 Yuxin Wang, Qiang Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Kaiyong Zhao, Xiaowen Chu

Different from the existing end-to-end benchmarks which only present the training time, We try to investigate the impact of hardware, vendor's software library, and deep learning framework on the performance and energy consumption of AI training.

Benchmarking

AutoML: A Survey of the State-of-the-Art

2 code implementations2 Aug 2019 Xin He, Kaiyong Zhao, Xiaowen Chu

Deep learning (DL) techniques have penetrated all aspects of our lives and brought us great convenience.

Feature Engineering Hyperparameter Optimization +1

A Distributed Synchronous SGD Algorithm with Global Top-$k$ Sparsification for Low Bandwidth Networks

1 code implementation14 Jan 2019 Shaohuai Shi, Qiang Wang, Kaiyong Zhao, Zhenheng Tang, Yuxin Wang, Xiang Huang, Xiaowen Chu

Current methods that use AllGather to accumulate the sparse gradients have a communication complexity of $O(kP)$, where $P$ is the number of workers, which is inefficient on low bandwidth networks with a large number of workers.

MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

2 code implementations27 Nov 2018 Shaohuai Shi, Xiaowen Chu, Bo Li

Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks on computer clusters.

Distributed, Parallel, and Cluster Computing

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

no code implementations30 Jul 2018 Xianyan Jia, Shutao Song, wei he, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, Xiaowen Chu

(3) We propose highly optimized all-reduce algorithms that achieve up to 3x and 11x speedup on AlexNet and ResNet-50 respectively than NCCL-based training on a cluster with 1024 Tesla P40 GPUs.

Playing the Game of 2048

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

1 code implementation16 Nov 2017 Shaohuai Shi, Xiaowen Chu

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry.

Distributed, Parallel, and Cluster Computing

Performance Evaluation of Deep Learning Tools in Docker Containers

no code implementations9 Nov 2017 Pengfei Xu, Shaohuai Shi, Xiaowen Chu

We first benchmark the performance of system components (IO, CPU and GPU) in a docker container and the host system and compare the results to see if there's any difference.

Management

Speeding up Convolutional Neural Networks By Exploiting the Sparsity of Rectifier Units

no code implementations25 Apr 2017 Shaohuai Shi, Xiaowen Chu

Rectifier neuron units (ReLUs) have been widely used in deep convolutional networks.

Supervised Learning Based Algorithm Selection for Deep Neural Networks

no code implementations10 Feb 2017 Shaohuai Shi, Pengfei Xu, Xiaowen Chu

In this paper, we target at optimizing the operations of multiplying a matrix with the transpose of another matrix (referred to as NT operation hereafter), which contribute about half of the training time of fully connected deep neural networks.

Benchmarking State-of-the-Art Deep Learning Software Tools

no code implementations25 Aug 2016 Shaohuai Shi, Qiang Wang, Pengfei Xu, Xiaowen Chu

We first benchmark the running performance of these tools with three popular types of neural networks on two CPU platforms and three GPU platforms.

Benchmarking

Dissecting GPU Memory Hierarchy through Microbenchmarking

1 code implementation8 Sep 2015 Xinxin Mei, Xiaowen Chu

Memory access efficiency is a key factor in fully utilizing the computational power of graphics processing units (GPUs).

Hardware Architecture Distributed, Parallel, and Cluster Computing

Cannot find the paper you are looking for? You can Submit a new open access paper.