no code implementations • 6 Apr 2024 • Pei Wang, Zhaowei Cai, Hao Yang, Ashwin Swaminathan, R. Manmatha, Stefano Soatto
Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task.
no code implementations • 3 Apr 2024 • Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto
On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size.
no code implementations • 15 Nov 2023 • Peng Tang, Pengkai Zhu, Tian Li, Srikar Appalaraju, Vijay Mahadevan, R. Manmatha
Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step.
no code implementations • 15 Nov 2023 • Peng Tang, Srikar Appalaraju, R. Manmatha, Yusheng Xie, Vijay Mahadevan
We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models.
no code implementations • ICCV 2023 • Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan
We present a new formulation for structured information extraction (SIE) from visually rich documents.
Ranked #2 on Entity Linking on FUNSD
1 code implementation • 2 Jun 2023 • Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU).
Ranked #9 on Visual Question Answering (VQA) on DocVQA test (using extra training data)
document understanding Optical Character Recognition (OCR) +1
1 code implementation • CVPR 2023 • Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, R. Manmatha
In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks.
Ranked #1 on Referring Expression Segmentation on ReferIt (using extra training data)
no code implementations • 7 Feb 2023 • Yash Patel, Yusheng Xie, Yi Zhu, Srikar Appalaraju, R. Manmatha
Instead of purely relying on the alignment from the noisy data, this paper proposes a novel loss function termed SimCon, which accounts for intra-modal similarities to determine the appropriate set of positive samples to align.
1 code implementation • 15 Nov 2022 • Chih-Hui Ho, Srikar Appalaraju, Bhavan Jasani, R. Manmatha, Nuno Vasconcelos
We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task.
2 code implementations • 5 Aug 2022 • Roi Ronen, Shahar Tsiper, Oron Anschel, Inbal Lavi, Amir Markovitz, R. Manmatha
In recent years, the dominant paradigm for text spotting is to combine the tasks of text detection and recognition into a single end-to-end framework.
Ranked #6 on Text Spotting on Total-Text
no code implementations • CVPR 2022 • Yair Kittenplon, Inbal Lavi, Sharon Fogel, Yarin Bar, R. Manmatha, Pietro Perona
Text spotting end-to-end methods have recently gained attention in the literature due to the benefits of jointly optimizing the text detection and recognition components.
1 code implementation • CVPR 2022 • Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, R. Manmatha
Accounting for this, we propose a single objective pre-training scheme that requires only text and spatial cues.
1 code implementation • ICCV 2021 • Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha
DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer.
Ranked #3 on Document Image Classification on RVL-CDIP
no code implementations • 23 Dec 2020 • Ron Slossberg, Oron Anschel, Amir Markovitz, Ron Litman, Aviad Aberdam, Shahar Tsiper, Shai Mazor, Jon Wu, R. Manmatha
Although the topic of confidence calibration has been an active research area for the last several decades, the case of structured and sequence prediction calibration has been scarcely explored.
2 code implementations • CVPR 2021 • Aviad Aberdam, Ron Litman, Shahar Tsiper, Oron Anschel, Ron Slossberg, Shai Mazor, R. Manmatha, Pietro Perona
We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition.
1 code implementation • 11 Dec 2020 • Yi Zhu, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari, Yuanjun Xiong, Chongruo wu, Zhi Zhang, Joseph Tighe, R. Manmatha, Mu Li
Video action recognition is one of the representative tasks for video understanding.
no code implementations • 20 Aug 2020 • Minesh Mathew, Ruben Tito, Dimosthenis Karatzas, R. Manmatha, C. V. Jawahar
For the task 1 a new dataset is introduced comprising 50, 000 questions-answer(s) pairs defined over 12, 767 document images.
no code implementations • 30 Apr 2020 • Yi Zhu, Zhongyue Zhang, Chongruo wu, Zhi Zhang, Tong He, Hang Zhang, R. Manmatha, Mu Li, Alexander Smola
In the case of semantic segmentation, this means that large amounts of pixelwise annotations are required to learn accurate models.
35 code implementations • 19 Apr 2020 • Hang Zhang, Chongruo wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, Alexander Smola
It is well known that featuremap attention and multi-path representation are important for visual recognition.
Ranked #8 on Instance Segmentation on COCO test-dev (APM metric)
2 code implementations • CVPR 2020 • Ron Litman, Oron Anschel, Shahar Tsiper, Roee Litman, Shai Mazor, R. Manmatha
The first attention step re-weights visual features from a CNN backbone together with contextual features computed by a BiLSTM layer.
no code implementations • 12 Feb 2020 • Yash Patel, Srikar Appalaraju, R. Manmatha
The proposed compression model incorporates the salient regions and optimizes on the proposed perceptual similarity metric.
no code implementations • 9 Aug 2019 • Yash Patel, Srikar Appalaraju, R. Manmatha
Recently, there has been much interest in deep learning techniques to do image compression and there have been claims that several of these produce better results than engineered compression schemes (such as JPEG, JPEG2000 or BPG).
no code implementations • 18 Jul 2019 • Yash Patel, Srikar Appalaraju, R. Manmatha
In several cases, the MS-SSIM for deep learned techniques is higher than say a conventional, non-deep learned codec such as JPEG-2000 or BPG.
no code implementations • 4 Jul 2019 • Son Tran, Ming Du, Sampath Chanda, R. Manmatha, Cj Taylor
In particular, Instagram and Twitter influencers often provide images of themselves wearing different outfits and their followers are often inspired to buy similar clothes. We propose a system to automatically find the closest visually similar clothes in the online Catalog (street-to-shop searching).
1 code implementation • CVPR 2018 • Chao-yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, Philipp Krähenbühl
), we propose to train a deep network directly on the compressed video.
Ranked #46 on Action Classification on Charades (using extra training data)
6 code implementations • ICCV 2017 • Chao-yuan Wu, R. Manmatha, Alexander J. Smola, Philipp Krähenbühl
In addition, we show that a simple margin based loss is sufficient to outperform all other loss functions.
Ranked #5 on Image Retrieval on CARS196
no code implementations • CVPR 2016 • Venkatesh N. Murthy, Vivek Singh, Terrence Chen, R. Manmatha, Dorin Comaniciu
During the learning phase, starting from the root network node, DDN automatically builds a network that splits the data into disjoint clusters of classes which would be handled by the subsequent expert networks.