Search Results for author: Jialin Wu

Found 23 papers, 7 papers with code

Distilling Vision-Language Models on Millions of Videos

no code implementations • 11 Jan 2024 • Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan

Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

Language Modelling Retrieval +2

Paper
Add Code

GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning

no code implementations • 19 Dec 2023 • Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, Radu Soricut

In this paper, we evaluate the reasoning capabilities of VLMs along various axes through the lens of geometry problems.

Mathematical Reasoning

Paper
Add Code

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

no code implementations • 1 Dec 2023 • Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut

Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks.

Ranked #1 on Visual Question Answering (VQA) on A-OKVQA (using extra training data)

Chart Question Answering Document AI +2

Paper
Add Code

Non-Intrusive Adaptation: Input-Centric Parameter-efficient Fine-Tuning for Versatile Multimodal Modeling

no code implementations • 18 Oct 2023 • Yaqing Wang, Jialin Wu, Tanmaya Dabral, Jiageng Zhang, Geoff Brown, Chun-Ta Lu, Frederick Liu, Yi Liang, Bo Pang, Michael Bendersky, Radu Soricut

Intrusive PEFT techniques directly change a model's internal architecture.

Paper
Add Code

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

1 code implementation • 13 Oct 2023 • Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger.

Ranked #2 on Temporal/Casual QA on NExT-QA (using extra training data)

Chart Question Answering Image Classification +4

117

Paper
Code

CausalLM is not optimal for in-context learning

1 code implementation • 14 Aug 2023 • Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut

Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples.

In-Context Learning Language Modelling

Paper
Code

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

1 code implementation • 28 Jul 2023 • Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, Brianna Zitkovich

Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web.

Object Question Answering +1

267

Paper
Code

PaLI-X: On Scaling up a Multilingual Vision and Language Model

2 code implementations • 29 May 2023 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

Ranked #1 on Fine-Grained Image Recognition on OVEN

Chart Question Answering document understanding +9

Paper
Code

Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering

no code implementations • 18 Oct 2022 • Jialin Wu, Raymond J. Mooney

To address these issues, we propose an Entity-Focused Retrieval (EnFoRe) model that provides stronger supervision during training and recognizes question-relevant entities to help retrieve more specific knowledge.

Passage Retrieval Question Answering +2

Paper
Add Code

Possibilities and Implications of the Multi-AI Competition

no code implementations • 1 Sep 2022 • Jialin Wu

The possibility of super-AIs taking over the world has been intensively studied by numerous scholars.

Paper
Add Code

Breaking Down Questions for Outside-Knowledge VQA

no code implementations • 29 Sep 2021 • Jialin Wu, Ray Mooney

While general Visual Question Answering (VQA) focuses on querying visual content within an image, there is a recent trend towards Knowledge-Based VQA (KB-VQA) where a system needs to link some aspects of the question to different types of knowledge beyond the image, such as commonsense concepts and factual information.

Question Answering Visual Question Answering

Paper
Add Code

Multi-Modal Answer Validation for Knowledge-Based VQA

1 code implementation • 23 Mar 2021 • Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi

Instead of searching for the answer in a vast collection of often irrelevant facts as most existing approaches do, MAVEx aims to learn how to extract relevant knowledge from noisy sources, which knowledge source to trust for each answer candidate, and how to validate the candidate using that source.

Question Answering Retrieval +1

Paper
Code

Visual Question Answering based on Local-Scene-Aware Referring Expression Generation

no code implementations • 22 Jan 2021 • Jung-Jun Kim, Dong-Gyu Lee, Jialin Wu, Hong-Gyu Jung, Seong-Whan Lee

We quantitatively and qualitatively evaluated the proposed method on the VQA v2 dataset and compared it with state-of-the-art methods in terms of answer prediction.

Question Answering Referring Expression +2

Paper
Add Code

CoNAN: A Complementary Neighboring-based Attention Network for Referring Expression Generation

no code implementations • COLING 2020 • Jungjun Kim, Hanbin Ko, Jialin Wu

These highly-related neighbors are determined by an attentional ranking module, as complementary features, highlighting the discriminating aspects for the target object.

Object Referring Expression +1

Paper
Add Code

Improving VQA and its Explanations \\ by Comparing Competing Explanations

no code implementations • 28 Jun 2020 • Jialin Wu, Liyan Chen, Raymond J. Mooney

Most recent state-of-the-art Visual Question Answering (VQA) systems are opaque black boxes that are only trained to fit the answer distribution given the question and visual content.

Question Answering Visual Question Answering

Paper
Add Code

Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder

no code implementations • 31 Oct 2019 • Jialin Wu, Raymond J. Mooney

Most RNN-based image captioning models receive supervision on the output words to mimic human captions.

Image Captioning Sentence

Paper
Add Code

Generating Question Relevant Captions to Aid Visual Question Answering

no code implementations • ACL 2019 • Jialin Wu, Zeyuan Hu, Raymond J. Mooney

Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision.

Ranked #31 on Visual Question Answering (VQA) on VQA v2 test-std

General Knowledge Image Captioning +2

Paper
Add Code

Self-Critical Reasoning for Robust Visual Question Answering

1 code implementation • NeurIPS 2019 • Jialin Wu, Raymond J. Mooney

Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution.

Ranked #6 on Visual Question Answering (VQA) on VQA-CP

Question Answering Visual Question Answering

Paper
Code

Image Score: How to Select Useful Samples

no code implementations • ICLR 2019 • Simiao Zuo, Jialin Wu

There has long been debates on how we could interpret neural networks and understand the decisions our models make.

Decision Making

Paper
Add Code

Faithful Multimodal Explanation for Visual Question Answering

1 code implementation • WS 2019 • Jialin Wu, Raymond J. Mooney

AI systems' ability to explain their reasoning is critical to their utility and trustworthiness.

Ranked #5 on Explanatory Visual Question Answering on GQA-REX

Explanatory Visual Question Answering Question Answering

Paper
Code

Joint Image Captioning and Question Answering

no code implementations • 22 May 2018 • Jialin Wu, Zeyuan Hu, Raymond J. Mooney

Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers.

Image Captioning Question Answering +1

Paper
Add Code

Dynamic Filtering with Large Sampling Field for ConvNets

no code implementations • ECCV 2018 • Jialin Wu, Dai Li, Yu Yang, Chandrajit Bajaj, Xiangyang Ji

We propose a dynamic filtering strategy with large sampling field for ConvNets (LS-DFN), where the position-specific kernels learn from not only the identical position but also multiple sampled neighbor regions.

object-detection Object Detection +3

Paper
Add Code

Action Recognition with Joint Attention on Multi-Level Deep Features

no code implementations • 9 Jul 2016 • Jialin Wu, Gu Wang, Wukui Yang, Xiangyang Ji

We propose a novel deep supervised neural network for the task of action recognition in videos, which implicitly takes advantage of visual tracking and shares the robustness of both deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN).

Action Recognition In Videos Temporal Action Localization +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.