no code implementations • Findings (ACL) 2022 • Minji Seo, YeonJoon Jung, Seungtaek Choi, Seung-won Hwang, Bei Liu
We study event understanding as a critical step towards visual commonsense tasks. Meanwhile, we argue that current object-based event understanding is purely likelihood-based, leading to incorrect event prediction, due to biased correlation between events and objects. We propose to mitigate such biases with do-calculus, proposed in causality research, but overcoming its limited robustness, by an optimized aggregation with association-based prediction. We show the effectiveness of our approach, intrinsically by comparing our generated events with ground-truth event annotation, and extrinsically by downstream commonsense tasks.
1 code implementation • 14 Oct 2023 • Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian
Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks.
no code implementations • 22 Aug 2023 • Yuchong Sun, Bei Liu, Xu Chen, Ruihua Song, Jianlong Fu
Experiments on ViCo-20k show that the comments generated by our ViCo model exhibit the best performance in terms of both quantitative and qualitative results, particularly when engagement is considered.
no code implementations • ICCV 2023 • Seogkyu Jeon, Bei Liu, Pilhyeon Lee, Kibeom Hong, Jianlong Fu, Hyeran Byun
Due to the data absence, the textual description of the target domain and the vision-language models, e. g., CLIP, are utilized to effectively guide the generator.
no code implementations • 18 Jul 2023 • Kai Katsumata, Duc Minh Vo, Bei Liu, Hideki Nakayama
The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem.
no code implementations • ICCV 2023 • Yi-Syuan Chen, Yun-Zhu Song, Cheng Yu Yeo, Bei Liu, Jianlong Fu, Hong-Han Shuai
To this end, we raise a question: ``How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?".
no code implementations • 9 Jun 2023 • Jiange Yang, Wenhui Tan, Chuhao Jin, Keling Yao, Bei Liu, Jianlong Fu, Ruihua Song, Gangshan Wu, LiMin Wang
In this paper, we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models, to condition robot manipulation tasks.
no code implementations • 31 May 2023 • Kai Katsumata, Duc Minh Vo, Bei Liu, Hideki Nakayama
The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem.
no code implementations • 30 May 2023 • Chuhao Jin, Wenhui Tan, Jiange Yang, Bei Liu, Ruihua Song, LiMin Wang, Jianlong Fu
We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks, such as making a smiley face using building blocks.
no code implementations • 18 May 2023 • Hang Shao, Wei Wang, Bei Liu, Xun Gong, Haoyu Wang, Yanmin Qian
Due to the rapid development of computing hardware resources and the dramatic growth of data, pre-trained models in speech recognition, such as Whisper, have significantly improved the performance of speech recognition tasks.
1 code implementation • CVPR 2023 • Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo
To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i. e., MM-Diffusion), with two-coupled denoising autoencoders.
1 code implementation • 12 Oct 2022 • Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
Ranked #2 on Video Retrieval on QuerYD (using extra training data)
1 code implementation • 14 Sep 2022 • Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo
and 2) how to mitigate the impact of these factors?
Ranked #2 on Video Retrieval on MSR-VTT-1kA (using extra training data)
1 code implementation • 7 Sep 2022 • Yiyang Ma, Huan Yang, Bei Liu, Jianlong Fu, Jiaying Liu
To address this issue, we propose a Prompt-based Cross-Modal Generation Framework (PCM-Frame) to leverage two powerful pre-trained models, including CLIP and StyleGAN.
1 code implementation • 11 Aug 2022 • Tiankai Hang, Huan Yang, Bei Liu, Jianlong Fu, Xin Geng, Baining Guo
Specifically, we propose a recurrent motion generator to extract a series of semantic and motion information from the language and feed it along with visual information to a pre-trained StyleGAN to generate high-quality frames.
no code implementations • 10 Aug 2022 • Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu
In this paper we provide the technique report of Ego4D natural language query challenge in CVPR 2022.
2 code implementations • NeurIPS 2021 • Minghao Chen, Kan Wu, Bolin Ni, Houwen Peng, Bei Liu, Jianlong Fu, Hongyang Chao, Haibin Ling
Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures.
1 code implementation • CVPR 2022 • Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo
To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts.
Ranked #16 on Video Retrieval on MSR-VTT
1 code implementation • 19 Oct 2021 • Yupan Huang, Hongwei Xue, Bei Liu, Yutong Lu
We adopt Transformer as our unified architecture for its strong performance and task-agnostic design.
1 code implementation • 19 Oct 2021 • Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu
In this work, we demonstrate such an AI creation system to produce both diverse captions and rich images.
no code implementations • 6 Sep 2021 • Hongwei Xue, Bei Liu, Huan Yang, Jianlong Fu, Houqiang Li, Jiebo Luo
To tackle this problem, we propose a model named FGLA to generate high-quality and realistic videos by learning Fine-Grained motion embedding for Landscape Animation.
no code implementations • 10 Aug 2021 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao
To solve the partial visual confusion issue, we propose to leverage the carried context information of context reference, which is the concentric bigger box of each region proposal, to perform more accurate region classification and regression.
no code implementations • NeurIPS 2021 • Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo
To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment.
no code implementations • NeurIPS 2021 • Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo
To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment.
3 code implementations • CVPR 2021 • Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu
As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.
Ranked #5 on Visual Entailment on SNLI-VE val
1 code implementation • 30 Sep 2020 • Han Wu, Wenjie Ruan, Jiangtao Wang, Dingchang Zheng, Bei Liu, Yayuan Gen, Xiangfei Chai, Jian Chen, Kunwei Li, Shaolin Li, Sumi Helal
The black-box nature of machine learning models hinders the deployment of some high-accuracy models in medical diagnosis.
1 code implementation • 2 Apr 2020 • Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks.
no code implementations • 24 Nov 2019 • Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, Jin Zhou
A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products.
no code implementations • 29 Oct 2019 • Bei Liu, Zhicheng Huang, Zhaoyang Zeng, Zheyu Chen, Jianlong Fu
We propose to boost VQA by leveraging more powerful feature extractors by improving the representation ability of both visual and text features and the ensemble of models.
no code implementations • 19 Oct 2019 • Shi Chenfei, Yan Xue, Chuan Jiang, Hui Tian, Bei Liu
The main contributions of this paper are: firstly, a gastroscopic panorama reconstruction method is developed.
no code implementations • 4 Oct 2019 • Bo Wu, Wen-Huang Cheng, Peiye Liu, Bei Liu, Zhaoyang Zeng, Jiebo Luo
In the SMP Challenge at ACM Multimedia 2019, we introduce a novel prediction task Temporal Popularity Prediction, which focuses on predicting future interaction or attractiveness (in terms of clicks, views or likes etc.)
no code implementations • ICCV 2019 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, Lei Zhang
We study on weakly-supervised object detection (WSOD)which plays a vital role in relieving human involvement fromobject-level annotations.
1 code implementation • 11 Sep 2019 • Zhaoyang Zeng, Bei Liu, Jianlong Fu, Hongyang Chao, Lei Zhang
We study on weakly-supervised object detection (WSOD) which plays a vital role in relieving human involvement from object-level annotations.
no code implementations • 11 Jul 2019 • Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann
The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.
3 code implementations • 23 Apr 2018 • Bei Liu, Jianlong Fu, Makoto P. Kato, Masatoshi Yoshikawa
Extensive experiments are conducted with 8K images, among which 1. 5K image are randomly picked for evaluation.