Search Results for author: Jihyung Kil

Found 7 papers, 4 papers with code

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

no code implementations • 16 Feb 2024 • Jihyung Kil, Farideh Tavazoee, Dongyeop Kang, Joo-Kyung Kim

II-MMR then analyzes this path to identify different reasoning cases in current VQA benchmarks by estimating how many hops and what types (i. e., visual or beyond-visual) of reasoning are required to answer the question.

Question Answering Visual Question Answering

Paper
Add Code

Dual-View Visual Contextualization for Web Navigation

no code implementations • 6 Feb 2024 • Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao

Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites.

Paper
Add Code

GPT-4V(ision) is a Generalist Web Agent, if Grounded

1 code implementation • 3 Jan 2024 • Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su

The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering.

Image Captioning Question Answering +1

486

Paper
Code

PreSTU: Pre-Training for Scene-Text Understanding

no code implementations • ICCV 2023 • Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, Radu Soricut

The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective.

Decoder Image Captioning +3

Paper
Add Code

One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones

1 code implementation • CVPR 2022 • Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M. Sadler, Wei-Lun Chao, Yu Su

We study the problem of developing autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task.

Vision and Language Navigation

Paper
Code

Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering

1 code implementation • EMNLP 2021 • Jihyung Kil, Cheng Zhang, Dong Xuan, Wei-Lun Chao

We found that many of the "unknowns" to the learned VQA model are indeed "known" in the dataset implicitly.

Data Augmentation Question Answering +2

Paper
Code

Revisiting Document Representations for Large-Scale Zero-Shot Learning

1 code implementation • NAACL 2021 • Jihyung Kil, Wei-Lun Chao

Zero-shot learning aims to recognize unseen objects using their semantic representations.

Clustering Sentence +1

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.