Search Results for author: Yongfei Liu

Found 17 papers, 9 papers with code

ViTAR: Vision Transformer with Any Resolution

no code implementations • 27 Mar 2024 • Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang

Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration.

Self-Supervised Learning Semantic Segmentation

Paper
Add Code

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

no code implementations • 3 Mar 2024 • Haogeng Liu, Quanzeng You, Xiaotian Han, Yiqi Wang, Bohan Zhai, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang

Multimodal Large Language Models (MLLMs) have experienced significant advancements recently.

Ranked #41 on Visual Question Answering on MM-Vet

Visual Question Answering

Paper
Add Code

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

no code implementations • 10 Jan 2024 • Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang

In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions.

Multimodal Reasoning

Paper
Add Code

Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts

no code implementations • 3 Dec 2023 • Tianqi Chen, Yongfei Liu, Zhendong Wang, Jianbo Yuan, Quanzeng You, Hongxia Yang, Mingyuan Zhou

In light of the remarkable success of in-context learning in large language models, its potential extension to the vision domain, particularly with visual foundation models like Stable Diffusion, has sparked considerable interest.

In-Context Learning

Paper
Add Code

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

no code implementations • 28 Nov 2023 • Xiaohui Chen, Yongfei Liu, Yingxiang Yang, Jianbo Yuan, Quanzeng You, Li-Ping Liu, Hongxia Yang

Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts.

Image Generation

Paper
Add Code

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

no code implementations • 20 Nov 2023 • Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng, Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, Heng Wang, Hongxia Yang

To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks.

Paper
Add Code

Grounded Image Text Matching with Mismatched Relation Reasoning

no code implementations • ICCV 2023 • Yu Wu, Yana Wei, Haozhe Wang, Yongfei Liu, Sibei Yang, Xuming He

This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models.

Image-text matching Relation +2

Paper
Add Code

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

1 code implementation • CVPR 2023 • Shan Ning, Longtian Qiu, Yongfei Liu, Xuming He

In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection.

Ranked #8 on Human-Object Interaction Detection on V-COCO

Decoder Human-Object Interaction Detection +3

Paper
Code

Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning

no code implementations • 2 Mar 2023 • Bo Wan, Yongfei Liu, Desen Zhou, Tinne Tuytelaars, Xuming He

Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building-block for many vision tasks.

Human-Object Interaction Detection Knowledge Distillation +3

Paper
Add Code

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

1 code implementation • CVPR 2022 • Estelle Aflalo, Meng Du, Shao-Yen Tseng, Yongfei Liu, Chenfei Wu, Nan Duan, Vasudev Lal

Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems.

Question Answering Visual Commonsense Reasoning +1

Paper
Code

Cascaded Sparse Feature Propagation Network for Interactive Segmentation

1 code implementation • 10 Mar 2022 • Chuyu Zhang, Chuanyang Hu, Hui Ren, Yongfei Liu, Xuming He

We aim to tackle the problem of point-based interactive segmentation, in which the key challenge is to propagate the user-provided annotations to unlabeled regions efficiently.

Ranked #4 on Interactive Segmentation on SBD

Foreground Segmentation Interactive Segmentation +2

Paper
Code

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

1 code implementation • Findings (NAACL) 2022 • Yongfei Liu, Chenfei Wu, Shao-Yen Tseng, Vasudev Lal, Xuming He, Nan Duan

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning.

Knowledge Distillation Object +1

Paper
Code

GEM: A General Evaluation Benchmark for Multimodal Tasks

1 code implementation • Findings (ACL) 2021 • Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, Arun Sacheti

Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages.

Paper
Code

Relation-aware Instance Refinement for Weakly Supervised Visual Grounding

1 code implementation • CVPR 2021 • Yongfei Liu, Bo Wan, Lin Ma, Xuming He

Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding.

Object Relation +3

Paper
Code

Part-aware Prototype Network for Few-shot Semantic Segmentation

2 code implementations • ECCV 2020 • Yongfei Liu, Xiangyi Zhang, Songyang Zhang, Xuming He

In this paper, we propose a novel few-shot semantic segmentation framework based on the prototype representation.

Ranked #3 on Few-Shot Semantic Segmentation on Pascal5i

Few-Shot Semantic Segmentation Object +2

122

Paper
Code

Learning Cross-modal Context Graph for Visual Grounding

2 code implementations • 20 Nov 2019 • Yongfei Liu, Bo Wan, Xiaodan Zhu, Xuming He

To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.

Graph Matching Visual Grounding

Paper
Code

Pose-aware Multi-level Feature Network for Human Object Interaction Detection

1 code implementation • ICCV 2019 • Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li, Xuming He

Reasoning human object interactions is a core problem in human-centric scene understanding and detecting such relations poses a unique challenge to vision systems due to large variations in human-object configurations, multiple co-occurring relation instances and subtle visual difference between relation categories.

Human-Object Interaction Detection Object +2

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.