no code implementations • 5 Mar 2024 • Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto
We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering.
no code implementations • 4 Mar 2024 • Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, Stefano Soatto
Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions.
1 code implementation • 11 May 2023 • Zhaoyang Zhang, Yantao Shen, Kunyu Shi, Zhaowei Cai, Jun Fang, Siqi Deng, Hao Yang, Davide Modolo, Zhuowen Tu, Stefano Soatto
We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks which may interfere with each other, resulting in a single model which we named Musketeer.
1 code implementation • CVPR 2020 • Justin Lazarow, Kwonjoon Lee, Kunyu Shi, Zhuowen Tu
Panoptic segmentation requires segments of both "things" (countable object instances) and "stuff" (uncountable and amorphous regions) within a single output.
Ranked #22 on Panoptic Segmentation on COCO test-dev