Search Results for author: Yuanhan Zhang

Found 21 papers, 16 papers with code

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

no code implementations • 6 May 2024 • Yuanhan Zhang, Kaichen Zhang, Bo Li, Fanyi Pu, Christopher Arif Setiadharma, Jingkang Yang, Ziwei Liu

Multimodal information, together with our knowledge, help us to understand the complex and dynamic world.

Multiple-choice Video Understanding +1

Paper
Add Code

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

1 code implementation • 1 Apr 2024 • Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, Yiming Yang

Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM).

Instruction Following Language Modelling +3

Paper
Code

VBench: Comprehensive Benchmark Suite for Video Generative Models

1 code implementation • 29 Nov 2023 • Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, LiMin Wang, Dahua Lin, Yu Qiao, Ziwei Liu

We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations, and also include more video generation models in VBench to drive forward the field of video generation.

Image Generation Video Generation

294

Paper
Code

OtterHD: A High-Resolution Multi-modality Model

1 code implementation • 7 Nov 2023 • Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu

In this paper, we present OtterHD-8B, an innovative multimodal model evolved from Fuyu-8B, specifically engineered to interpret high-resolution visual inputs with granular precision.

Ranked #86 on Visual Question Answering on MM-Vet

Visual Question Answering

3,461

Paper
Code

Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images

no code implementations • 2 Nov 2023 • Zalan Fabian, Zhongqi Miao, Chunyuan Li, Yuanhan Zhang, Ziwei Liu, Andrés Hernández, Andrés Montes-Rojas, Rafael Escucha, Laura Siabatto, Andrés Link, Pablo Arbeláez, Rahul Dodhia, Juan Lavista Ferres

In particular, we instruction tune vision-language models to generate detailed visual descriptions of camera trap images using similar terminology to experts.

Paper
Add Code

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

1 code implementation • 12 Oct 2023 • Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu

Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning.

Decision Making

234

Paper
Code

MMBench: Is Your Multi-modal Model an All-around Player?

3 code implementations • 12 Jul 2023 • YuAn Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin

In response to these challenges, we propose MMBench, a novel multi-modality benchmark.

Ranked #1 on Visual Question Answering on MMBench

Visual Question Answering

2,642

Paper
Code

FunQA: Towards Surprising Video Comprehension

1 code implementation • 26 Jun 2023 • Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu

Surprising videos, such as funny clips, creative performances, or visual illusions, attract significant attention.

Question Answering Text Generation +3

Paper
Code

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

2 code implementations • 8 Jun 2023 • Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, Ziwei Liu

We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.

Ranked #88 on Visual Question Answering on MM-Vet

In-Context Learning Visual Question Answering

3,461

Paper
Code

Learning without Forgetting for Vision-Language Models

no code implementations • 30 May 2023 • Da-Wei Zhou, Yuanhan Zhang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, Ziwei Liu

While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information.

Class Incremental Learning Incremental Learning

Paper
Add Code

Latent Distribution Adjusting for Face Anti-Spoofing

2 code implementations • 16 May 2023 • Qinghong Sun, Zhenfei Yin, Yichao Wu, Yuanhan Zhang, Jing Shao

In this work, we propose a unified framework called Latent Distribution Adjusting (LDA) with properties of latent, discriminative, adaptive, generic to improve the robustness of the FAS model by adjusting complex data distribution with multiple prototypes.

Face Anti-Spoofing Prototype Selection

200

Paper
Code

Otter: A Multi-Modal Model with In-Context Instruction Tuning

1 code implementation • 5 May 2023 • Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu

Large language models (LLMs) have demonstrated significant universal capabilities as few/zero-shot learners in various tasks due to their pre-training on vast amounts of text data, as exemplified by GPT-3, which boosted to InstrctGPT and ChatGPT, effectively following natural language instructions to accomplish real-world tasks.

Ranked #8 on Visual Question Answering on BenchLMM

In-Context Learning Instruction Following +2

3,461

Paper
Code

What Makes Good Examples for Visual In-Context Learning?

1 code implementation • NeurIPS 2023 • Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu

To overcome the problem, we propose a prompt retrieval framework to automate the selection of in-context examples.

In-Context Learning Retrieval

155

Paper
Code

3D Point Cloud Pre-training with Knowledge Distillation from 2D Images

no code implementations • 17 Dec 2022 • Yuan YAO, Yuanhan Zhang, Zhenfei Yin, Jiebo Luo, Wanli Ouyang, Xiaoshui Huang

The recent success of pre-trained 2D vision models is mostly attributable to learning from large-scale datasets.

Concept Alignment Knowledge Distillation +6

Paper
Add Code

On-Device Domain Generalization

2 code implementations • 15 Sep 2022 • Kaiyang Zhou, Yuanhan Zhang, Yuhang Zang, Jingkang Yang, Chen Change Loy, Ziwei Liu

Another interesting observation is that the teacher-student gap on out-of-distribution data is bigger than that on in-distribution data, which highlights the capacity mismatch issue as well as the shortcoming of KD.

Data Augmentation Domain Generalization +2

255

Paper
Code

Benchmarking Omni-Vision Representation through the Lens of Visual Realms

1 code implementation • 14 Jul 2022 • Yuanhan Zhang, Zhenfei Yin, Jing Shao, Ziwei Liu

We benchmark ReCo and other advances in omni-vision representation studies that are different in architectures (from CNNs to transformers) and in learning paradigms (from supervised learning to self-supervised learning) on OmniBenchmark.

Benchmarking Contrastive Learning +2

105

Paper
Code

Neural Prompt Search

1 code implementation • 9 Jun 2022 • Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu

The size of vision models has grown exponentially over the last few years, especially after the emergence of Vision Transformer.

Ranked #1 on Image Classification on OmniBenchmark (using extra training data)

Few-Shot Learning Image Classification +3

203

Paper
Code

Robust Face Anti-Spoofing with Dual Probabilistic Modeling

no code implementations • 27 Apr 2022 • Yuanhan Zhang, Yichao Wu, Zhenfei Yin, Jing Shao, Ziwei Liu

In this work, we attempt to fill this gap by automatically addressing the noise problem from both label and data perspectives in a probabilistic manner.

Face Anti-Spoofing