1 code implementation • 27 Feb 2024 • Zihao Liu, XiaoYu Zhang, Guangwei Liu, Ji Zhao, Ningyi Xu
Although the map construction is essentially a point set prediction task, MapQR utilizes instance queries rather than point queries.
1 code implementation • 16 Feb 2024 • Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu
The upscaling of Large Language Models (LLMs) has yielded impressive advances in natural language processing, yet it also poses significant deployment challenges.
1 code implementation • 3 Nov 2023 • Yijia Zhang, Sicheng Zhang, Shijie Cao, Dayou Du, Jianyu Wei, Ting Cao, Ningyi Xu
Large language models (LLMs) show great performance in various tasks, but face deployment challenges from limited memory capacity and bandwidth.
1 code implementation • 30 Oct 2023 • Qiao Sun, Shiduo Zhang, Danjiao Ma, Jingzhe Shi, Derun Li, Simian Luo, Yu Wang, Ningyi Xu, Guangzhi Cao, Hang Zhao
STR reformulates the motion prediction and motion planning problems by arranging observations, states, and actions into one unified sequence modeling task.
no code implementations • 31 May 2023 • Yijia Zhang, Yibo Han, Shijie Cao, Guohao Dai, Youshan Miao, Ting Cao, Fan Yang, Ningyi Xu
We find that previous gradient accumulation reduces activation memory but fails to be compatible with gradient memory reduction due to a contradiction between preserving gradients and releasing gradients.
no code implementations • 21 May 2023 • Yijia Zhang, Lingran Zhao, Shijie Cao, WenQiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu
In this study, we conduct a comparative analysis of INT and FP quantization with the same bit-width, revealing that the optimal quantization format varies across different layers due to the complexity and diversity of tensor distribution.
no code implementations • NeurIPS 2009 • Feng Yan, Ningyi Xu, Yuan Qi
Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors; actually the speedup is almost linearly scalable with the number of multiprocessors available.