Search Results for author: Xianzhi Du

Found 26 papers, 14 papers with code

Ferret: Refer and Ground Anything Anywhere at Any Granularity

1 code implementation11 Oct 2023 Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, BoWen Zhang, ZiRui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions.

Hallucination Language Modelling +1

Compressing LLMs: The Truth is Rarely Pure and Never Simple

1 code implementation2 Oct 2023 Ajay Jaiswal, Zhe Gan, Xianzhi Du, BoWen Zhang, Zhangyang Wang, Yinfei Yang

Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncompressed baseline.

Quantization Retrieval

Guiding Instruction-based Image Editing via Multimodal Large Language Models

2 code implementations29 Sep 2023 Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan

Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

Image Manipulation Response Generation

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

no code implementations8 Sep 2023 Erik Daxberger, Floris Weers, BoWen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du

We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.

MOFI: Learning Image Representations from Noisy Entity Annotated Images

1 code implementation13 Jun 2023 Wentao Wu, Aleksei Timofeev, Chen Chen, BoWen Zhang, Kun Duan, Shuangning Liu, Yantao Zheng, Jonathon Shlens, Xianzhi Du, Zhe Gan, Yinfei Yang

Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image.

Image Classification Image Retrieval +3

Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness

1 code implementation8 May 2023 Liangliang Cao, BoWen Zhang, Chen Chen, Yinfei Yang, Xianzhi Du, Wencong Zhang, Zhiyun Lu, Yantao Zheng

In this paper, we discuss two effective approaches to improve the efficiency and robustness of CLIP training: (1) augmenting the training dataset while maintaining the same number of optimization steps, and (2) filtering out samples that contain text regions in the image.

Adversarial Text Retrieval

AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts

1 code implementation ICCV 2023 Tianlong Chen, Xuxi Chen, Xianzhi Du, Abdullah Rashwan, Fan Yang, Huizhong Chen, Zhangyang Wang, Yeqing Li

Instead of compressing multiple tasks' knowledge into a single model, MoE separates the parameter space and only utilizes the relevant model pieces given task type and its input, which provides stabilized MTL training and ultra-efficient inference.

Instance Segmentation Multi-Task Learning +3

Optimizing Anchor-based Detectors for Autonomous Driving Scenes

no code implementations11 Aug 2022 Xianzhi Du, Wei-Chih Hung, Tsung-Yi Lin

This paper summarizes model improvements and inference-time optimizations for the popular anchor-based detectors in the scenes of autonomous driving.

Autonomous Driving

Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance

1 code implementation24 Feb 2022 Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, Tianbao Yang

In this paper, we study contrastive learning from an optimization perspective, aiming to analyze and address a fundamental issue of existing contrastive learning methods that either rely on a large batch size or a large dictionary of feature vectors.

Contrastive Learning Self-Supervised Learning +1

Auto-scaling Vision Transformers without Training

1 code implementation ICLR 2022 Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou

The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart.

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

3 code implementations17 Dec 2021 Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou

In this paper, we comprehensively study three architecture design choices on ViT -- spatial reduction, doubled channels, and multiscale features -- and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy.

Image Classification Instance Segmentation +6

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

no code implementations14 Dec 2021 Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown

The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.

Image Classification Knowledge Distillation +1

Revisiting 3D ResNets for Video Recognition

5 code implementations3 Sep 2021 Xianzhi Du, Yeqing Li, Yin Cui, Rui Qian, Jing Li, Irwan Bello

A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition.

Action Classification Contrastive Learning +1

Simple Training Strategies and Model Scaling for Object Detection

1 code implementation30 Jun 2021 Xianzhi Du, Barret Zoph, Wei-Chih Hung, Tsung-Yi Lin

We benchmark these improvements on the vanilla ResNet-FPN backbone with RetinaNet and RCNN detectors.

Instance Segmentation Object +3

Dilated SpineNet for Semantic Segmentation

no code implementations23 Mar 2021 Abdullah Rashwan, Xianzhi Du, Xiaoqi Yin, Jing Li

Scale-permuted networks have shown promising results on object bounding box detection and instance segmentation.

Instance Segmentation Segmentation +1

Revisiting ResNets: Improved Training and Scaling Strategies

3 code implementations NeurIPS 2021 Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, Barret Zoph

Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1. 7x - 2. 7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet.

Action Classification Document Image Classification +2

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

13 code implementations CVPR 2020 Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song

We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search.

General Classification Image Classification +5

TW-SMNet: Deep Multitask Learning of Tele-Wide Stereo Matching

no code implementations11 Jun 2019 Mostafa El-Khamy, Haoyu Ren, Xianzhi Du, Jungwon Lee

In this paper, we introduce the problem of estimating the real world depth of elements in a scene captured by two cameras with different field of views, where the first field of view (FOV) is a Wide FOV (WFOV) captured by a wide angle lens, and the second FOV is contained in the first FOV and is captured by a tele zoom lens.

Depth Estimation Disparity Estimation +2

AMNet: Deep Atrous Multiscale Stereo Disparity Estimation Networks

no code implementations19 Apr 2019 Xianzhi Du, Mostafa El-Khamy, Jungwon Lee

A stacked atrous multiscale network is proposed to aggregate rich multiscale contextual information from the cost volume which allows for estimating the disparity with high accuracy at multiple scales.

Disparity Estimation Stereo Disparity Estimation +2

Fused Deep Neural Networks for Efficient Pedestrian Detection

no code implementations2 May 2018 Xianzhi Du, Mostafa El-Khamy, Vlad I. Morariu, Jungwon Lee, Larry Davis

The classification system further classifies the generated candidates based on opinions of multiple deep verification networks and a fusion network which utilizes a novel soft-rejection fusion method to adjust the confidence in the detection results.

Ensemble Learning General Classification +2

Boundary-sensitive Network for Portrait Segmentation

no code implementations22 Dec 2017 Xianzhi Du, Xiaolong Wang, Dawei Li, Jingwen Zhu, Serafettin Tasci, Cameron Upright, Stephen Walsh, Larry Davis

Compared to the general semantic segmentation problem, portrait segmentation has higher precision requirement on boundary area.

Attribute Image Segmentation +3

Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection

no code implementations11 Oct 2016 Xianzhi Du, Mostafa El-Khamy, Jungwon Lee, Larry S. Davis

A single shot deep convolutional network is trained as a object detector to generate all possible pedestrian candidates of different sizes and occlusions.

Pedestrian Detection Semantic Segmentation

Cannot find the paper you are looking for? You can Submit a new open access paper.