Search Results for author: Mao Yang

Found 25 papers, 12 papers with code

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

no code implementations • 21 Feb 2024 • Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, Mao Yang

This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance.

Paper
Add Code

Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning

no code implementations • 14 Dec 2023 • Xijie Huang, Li Lyna Zhang, Kwang-Ting Cheng, Fan Yang, Mao Yang

In this work, we propose CoT-Influx, a novel approach that pushes the boundary of few-shot Chain-of-Thoughts (CoT) learning to improve LLM mathematical reasoning.

Ranked #105 on Arithmetic Reasoning on GSM8K

Arithmetic Reasoning Few-Shot Learning +3

Paper
Add Code

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

1 code implementation • 8 Oct 2023 • Song Guo, Jiahang Xu, Li Lyna Zhang, Mao Yang

To this end, Compresso prunes LLaMA-7B to 5. 4B, maintaining original performance and even surpassing LLaMA-7B in reading comprehension by 2. 62%.

Natural Language Understanding Reading Comprehension

Paper
Code

Model-enhanced Vector Index

1 code implementation • NeurIPS 2023 • Hailin Zhang, Yujing Wang, Qi Chen, Ruiheng Chang, Ting Zhang, Ziming Miao, Yingyan Hou, Yang Ding, Xupeng Miao, Haonan Wang, Bochen Pang, Yuefeng Zhan, Hao Sun, Weiwei Deng, Qi Zhang, Fan Yang, Xing Xie, Mao Yang, Bin Cui

We empirically show that our model achieves better performance on the commonly used academic benchmarks MSMARCO Passage and Natural Questions, with comparable serving latency to dense retrieval solutions.

Natural Questions Quantization +1

Paper
Code

Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations

no code implementations • 16 Sep 2023 • Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang

Web applications are increasingly becoming the primary platform for AI service delivery, making in-browser deep learning (DL) inference more prominent.

Paper
Add Code

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

no code implementations • 23 Aug 2023 • Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang

To tackle the high compute requirements of LLMs, the Mixture-of-Experts (MoE) architecture was introduced which is able to scale its model size without proportionally scaling up its computational requirements.

Paper
Add Code

Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

1 code implementation • 26 Jun 2023 • Junyan Li, Li Lyna Zhang, Jiahang Xu, Yujing Wang, Shaoguang Yan, Yunqing Xia, Yuqing Yang, Ting Cao, Hao Sun, Weiwei Deng, Qi Zhang, Mao Yang

Deploying pre-trained transformer models like BERT on downstream tasks in resource-constrained scenarios is challenging due to their high inference cost, which grows rapidly with input sequence length.

Model Compression

Paper
Code

Accurate and Structured Pruning for Efficient Automatic Speech Recognition

no code implementations • 31 May 2023 • Huiqiang Jiang, Li Lyna Zhang, Yuang Li, Yu Wu, Shijie Cao, Ting Cao, Yuqing Yang, Jinyu Li, Mao Yang, Lili Qiu

In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

no code implementations • 21 May 2023 • Yijia Zhang, Lingran Zhao, Shijie Cao, WenQiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu

In this study, we conduct a comparative analysis of INT and FP quantization with the same bit-width, revealing that the optimal quantization format varies across different layers due to the complexity and diversity of tensor distribution.

Quantization

Paper
Add Code

IRGen: Generative Modeling for Image Retrieval

1 code implementation • 17 Mar 2023 • Yidan Zhang, Ting Zhang, Dong Chen, Yujing Wang, Qi Chen, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Fan Yang, Mao Yang, Qingmin Liao, Baining Guo

While generative modeling has been ubiquitous in natural language processing and computer vision, its application to image retrieval remains unexplored.

Image Retrieval Retrieval

Paper
Code

ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices

1 code implementation • ICCV 2023 • Chen Tang, Li Lyna Zhang, Huiqiang Jiang, Jiahang Xu, Ting Cao, Quanlu Zhang, Yuqing Yang, Zhi Wang, Mao Yang

However, prior supernet training methods that rely on uniform sampling suffer from the gradient conflict issue: the sampled subnets can have vastly different model sizes (e. g., 50M vs. 2G FLOPs), leading to different optimization directions and inferior performance.

Neural Architecture Search

Paper
Code

SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference

1 code implementation • ICCV 2023 • Li Lyna Zhang, Xudong Wang, Jiahang Xu, Quanlu Zhang, Yujing Wang, Yuqing Yang, Ningxin Zheng, Ting Cao, Mao Yang

The combination of Neural Architecture Search (NAS) and quantization has proven successful in automatically designing low-FLOPs INT8 quantized neural networks (QNN).

Neural Architecture Search Quantization

Paper
Code

LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup

no code implementations • 7 Feb 2023 • Xiaohu Tang, Yang Wang, Ting Cao, Li Lyna Zhang, Qi Chen, Deng Cai, Yunxin Liu, Mao Yang

On-device Deep Neural Network (DNN) inference consumes significant computing resources and development efforts.

Efficient Neural Network speech-recognition +1

Paper
Add Code

PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation

no code implementations • 26 Jan 2023 • Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Yuqing Yang, Lingxiao Ma, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, Lidong Zhou

Dynamic sparsity, where the sparsity patterns are unknown until runtime, poses a significant challenge to deep learning.

Paper
Add Code

SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction

no code implementations • 21 Jan 2023 • Zhiqi Lin, Youshan Miao, Guodong Liu, Xiaoxiang Shi, Quanlu Zhang, Fan Yang, Saeed Maleki, Yi Zhu, Xu Cao, Cheng Li, Mao Yang, Lintao Zhang, Lidong Zhou

SuperScaler is a system that facilitates the design and generation of highly flexible parallelization plans.

Scheduling

Paper
Add Code

SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

no code implementations • 30 Aug 2022 • Li Lyna Zhang, Youkow Homma, Yujing Wang, Min Wu, Mao Yang, Ruofei Zhang, Ting Cao, Wei Shen

Remarkably, under our latency requirement of 1900us on CPU, SwiftPruner achieves a 0. 86% higher AUC than the state-of-the-art uniform sparse baseline for BERT-Mini on a large scale real-world dataset.

Paper
Add Code

LordNet: Learning to Solve Parametric Partial Differential Equations without Simulated Data

no code implementations • 19 Jun 2022 • Wenlei Shi, Xinquan Huang, Xiaotian Gao, Xinran Wei, Jia Zhang, Jiang Bian, Mao Yang, Tie-Yan Liu

Neural operators, as a powerful approximation to the non-linear operators between infinite-dimensional function spaces, have proved to be promising in accelerating the solution of partial differential equations (PDE).

Paper
Add Code

Tutel: Adaptive Mixture-of-Experts at Scale

2 code implementations • 7 Jun 2022 • Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, Yongqiang Xiong

On efficiency, Flex accelerates SwinV2-MoE, achieving up to 1. 55x and 2. 11x speedup in training and inference over Fairseq, respectively.

Object Detection

12,910

Paper
Code

A Neural Corpus Indexer for Document Retrieval

1 code implementation • 6 Jun 2022 • Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao Sun, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie, Hao Allen Sun, Weiwei Deng, Qi Zhang, Mao Yang

To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query.

Retrieval TriviaQA

140

Paper
Code

SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search

1 code implementation • NeurIPS 2021 • Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, Jingdong Wang

It stores the centroid points of the posting lists in the memory and the large posting lists in the disk.

4,696

Paper
Code

WRENCH: A Comprehensive Benchmark for Weak Supervision

1 code implementation • 23 Sep 2021 • Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, Alexander Ratner

To address these problems, we introduce a benchmark platform, WRENCH, for thorough and standardized evaluation of WS approaches.

211

Paper
Code

SPANN: Highly-efficient Billion-scale Approximate Nearest Neighborhood Search

1 code implementation • NeurIPS 2021 • Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, Jingdong Wang

It stores the centroid points of the posting lists in the memory and the large posting lists in the disk.

4,696

Paper
Code

OpEvo: An Evolutionary Method for Tensor Operator Optimization

no code implementations • 10 Jun 2020 • Xiaotian Gao, Cui Wei, Lintao Zhang, Mao Yang

Training and inference efficiency of deep neural networks highly rely on the performance of tensor operators on hardware platforms.

Paper
Add Code

TextNAS: A Neural Architecture Search Space tailored for Text Representation

no code implementations • 23 Dec 2019 • Yujing Wang, Yaming Yang, Yiren Chen, Jing Bai, Ce Zhang, Guinan Su, Xiaoyu Kou, Yunhai Tong, Mao Yang, Lidong Zhou

Learning text representation is crucial for text classification and other language related tasks.

General Classification Natural Language Inference +3

Paper
Add Code

Time-Series Anomaly Detection Service at Microsoft

3 code implementations • 10 Jun 2019 • Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, Qi Zhang

At Microsoft, we develop a time-series anomaly detection service which helps customers to monitor the time-series continuously and alert for potential incidents on time.

Anomaly Detection Saliency Detection +2

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.