Search Results for author: Zhihao Jia

Found 25 papers, 11 papers with code

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

1 code implementation • 29 Feb 2024 • Xupeng Miao, Gabriele Oliaro, Xinhao Cheng, Mengdi Wu, Colin Unger, Zhihao Jia

This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests.

1,514

Paper
Code

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

1 code implementation • 19 Feb 2024 • Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding.

199

Paper
Code

Accelerating Retrieval-Augmented Language Model Serving with Speculation

no code implementations • 25 Jan 2024 • Zhihao Zhang, Alan Zhu, Lijie Yang, Yihua Xu, LanTing LI, Phitchaya Mangpo Phothilimthana, Zhihao Jia

Retrieval-augmented language models (RaLM) have demonstrated the potential to solve knowledge-intensive natural language processing (NLP) tasks by combining a non-parametric knowledge base with a parametric language model.

Language Modelling Retrieval

Paper
Add Code

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

1 code implementation • 13 Jan 2024 • Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, Zhihao Jia

Experiments show that QST can reduce the total memory footprint by up to 2. 3 $\times$ and speed up the finetuning process by up to 3 $\times$ while achieving competent performance compared with the state-of-the-art.

Paper
Code

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

no code implementations • 23 Dec 2023 • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia

In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data.

Language Modelling Large Language Model

Paper
Add Code

SpotServe: Serving Generative Large Language Models on Preemptible Instances

1 code implementation • 27 Nov 2023 • Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, Zhihao Jia

This paper aims to reduce the monetary cost for serving LLMs by leveraging preemptible GPU instances on modern clouds, which offer accesses to spare GPUs at a much cheaper price than regular instances but may be preempted by the cloud at any time.

Graph Matching

Paper
Code

Drone-NeRF: Efficient NeRF Based 3D Scene Reconstruction for Large-Scale Drone Survey

no code implementations • 30 Aug 2023 • Zhihao Jia, Bing Wang, Changhao Chen

In this work, we propose the Drone-NeRF framework to enhance the efficient reconstruction of unbounded large-scale scenes suited for drone oblique photography using Neural Radiance Fields (NeRF).

3D Scene Reconstruction Neural Rendering

Paper
Add Code

Quarl: A Learning-Based Quantum Circuit Optimizer

no code implementations • 17 Jul 2023 • Zikun Li, Jinjun Peng, Yixuan Mei, Sina Lin, Yi Wu, Oded Padon, Zhihao Jia

Applying reinforcement learning (RL) to quantum circuit optimization raises two main challenges: the large and varying action space and the non-uniform state representation.

Reinforcement Learning (RL)

Paper
Add Code

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

3 code implementations • 16 May 2023 • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia

Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1. 5-2. 8x for distributed LLM inference and by 2. 6-3. 5x for offloading-based LLM inference, while preserving the same generative performance.

Language Modelling Large Language Model

1,514

Paper
Code

Quark: A Gradient-Free Quantum Learning Framework for Classification Tasks

no code implementations • 2 Oct 2022 • Zhihao Zhang, Zhuoming Chen, Heyang Huang, Zhihao Jia

To address the limitations of existing quantum ML methods, we introduce Quark, a gradient-free quantum learning framework that optimizes quantum ML models using quantum optimization.

Edge Detection

Paper
Add Code

OLLIE: Derivation-based Tensor Program Optimizer

no code implementations • 2 Aug 2022 • Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shizhi Tang, Lei Xie, Kezhao Huang, Zhihao Jia

Boosting the runtime performance of deep neural networks (DNNs) is critical due to their wide adoption in real-world tasks.

Paper
Add Code

BOND: Benchmarking Unsupervised Outlier Node Detection on Static Attributed Graphs

2 code implementations • 21 Jun 2022 • Kay Liu, Yingtong Dou, Yue Zhao, Xueying Ding, Xiyang Hu, Ruitong Zhang, Kaize Ding, Canyu Chen, Hao Peng, Kai Shu, Lichao Sun, Jundong Li, George H. Chen, Zhihao Jia, Philip S. Yu

To bridge this gap, we present--to the best of our knowledge--the first comprehensive benchmark for unsupervised outlier node detection on static attributed graphs called BOND, with the following highlights.

Anomaly Detection Benchmarking +2

1,209

Paper
Code

Optimizing Mixture of Experts using Dynamic Recompilations

no code implementations • 4 May 2022 • Ferdinand Kossmann, Zhihao Jia, Alex Aiken

The Mixture of Experts architecture allows for outrageously large neural networks by scaling model parameter size independently from computational demand (FLOPs).

Paper
Add Code

Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs

no code implementations • 26 Apr 2022 • John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, Guoqing Harry Xu

DNN models across many domains continue to grow in size, resulting in high resource requirements for effective training, and unpalatable (and often unaffordable) costs for organizations and research labs across scales.

Paper
Add Code

Collage: Seamless Integration of Deep Learning Backends with Automatic Placement

1 code implementation • 1 Nov 2021 • Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia

Given the fast-evolving nature of the DL ecosystem, this manual approach often slows down continuous innovations across different layers; it prevents hardware vendors from the fast deployment of their cutting-edge libraries, DL framework developers must repeatedly adjust their hand-coded rules to accommodate new versions of libraries, and machine learning practitioners need to wait for the integration of new technologies and often encounter unsatisfactory performance.

Paper
Code

TOD: GPU-accelerated Outlier Detection via Tensor Operations

2 code implementations • 26 Oct 2021 • Yue Zhao, George H. Chen, Zhihao Jia

Outlier detection (OD) is a key learning task for finding rare and deviant data samples, with many time-critical applications such as fraud detection and intrusion detection.

Fraud Detection Intrusion Detection +2

168

Paper
Code

GradSign: Model Performance Inference with Theoretical Insights

1 code implementation • ICLR 2022 • Zhihao Zhang, Zhihao Jia

In addition, we design GradSign, an accurate and simple approximation of {\Psi} using the gradients of a network evaluated at a random initialization state.

Neural Architecture Search

Paper
Code

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads

1 code implementation • 24 May 2021 • John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, Guoqing Harry Xu

Computation separation makes it possible to construct a deep, bounded-asynchronous pipeline where graph and tensor parallel tasks can fully overlap, effectively hiding the network latency incurred by Lambdas.

Paper
Code

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

no code implementations • 12 Apr 2021 • Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, Vijay Rao

Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers.

Paper
Add Code

IOS: Inter-Operator Scheduler for CNN Acceleration

1 code implementation • 2 Nov 2020 • Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, Song Han

To accelerate CNN inference, existing deep learning frameworks focus on optimizing intra-operator parallelization.

185

Paper
Code

Redundancy-Free Computation Graphs for Graph Neural Networks

no code implementations • 9 Jun 2019 • Zhihao Jia, Sina Lin, Rex Ying, Jiaxuan You, Jure Leskovec, Alex Aiken

Graph Neural Networks (GNNs) are based on repeated aggregations of information across nodes' neighbors in a graph.

Paper
Add Code

Beyond Data and Model Parallelism for Deep Neural Networks

no code implementations • 14 Jul 2018 • Zhihao Jia, Matei Zaharia, Alex Aiken

We also propose FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine.

Distributed, Parallel, and Cluster Computing

Paper
Add Code

Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks

no code implementations • ICML 2018 • Zhihao Jia, Sina Lin, Charles R. Qi, Alex Aiken

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks.

Paper
Add Code

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

no code implementations • 14 Feb 2018 • Zhihao Jia, Sina Lin, Charles R. Qi, Alex Aiken

The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks.

Paper
Add Code

Exploring the Hidden Dimension in Accelerating Convolutional Neural Networks

no code implementations • ICLR 2018 • Zhihao Jia, Sina Lin, Charles R. Qi, Alex Aiken

DeePa is a deep learning framework that explores parallelism in all parallelizable dimensions to accelerate the training process of convolutional neural networks.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.