Search Results for author: R. Manmatha

Found 27 papers, 12 papers with code

Mixed-Query Transformer: A Unified Image Segmentation Architecture

no code implementations • 6 Apr 2024 • Pei Wang, Zhaowei Cai, Hao Yang, Ashwin Swaminathan, R. Manmatha, Stefano Soatto

Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task.

Data Augmentation Image Segmentation +2

Paper
Add Code

On the Scalability of Diffusion-based Text-to-Image Generation

no code implementations • 3 Apr 2024 • Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto

On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size.

Denoising Text-to-Image Generation

Paper
Add Code

DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models

no code implementations • 15 Nov 2023 • Peng Tang, Pengkai Zhu, Tian Li, Srikar Appalaraju, Vijay Mahadevan, R. Manmatha

Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step.

Paper
Add Code

Multiple-Question Multiple-Answer Text-VQA

no code implementations • 15 Nov 2023 • Peng Tang, Srikar Appalaraju, R. Manmatha, Yusheng Xie, Vijay Mahadevan

We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models.

Denoising Optical Character Recognition (OCR) +1

Paper
Add Code

DocTr: Document Transformer for Structured Information Extraction in Documents

no code implementations • ICCV 2023 • Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan

We present a new formulation for structured information extraction (SIE) from visually rich documents.

Ranked #2 on Entity Linking on FUNSD

Entity Linking Semantic entity labeling

Paper
Add Code

DocFormerv2: Local Features for Document Understanding

1 code implementation • 2 Jun 2023 • Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU).

Ranked #9 on Visual Question Answering (VQA) on DocVQA test (using extra training data)

document understanding Optical Character Recognition (OCR) +1

Paper
Code

PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

1 code implementation • CVPR 2023 • Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, R. Manmatha

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks.

Ranked #1 on Referring Expression Segmentation on ReferIt (using extra training data)

Image Segmentation Quantization +6

108

Paper
Code

SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation

no code implementations • 7 Feb 2023 • Yash Patel, Yusheng Xie, Yi Zhu, Srikar Appalaraju, R. Manmatha

Instead of purely relying on the alignment from the noisy data, this paper proposes a novel loss function termed SimCon, which accounts for intra-modal similarities to determine the appropriate set of positive samples to align.

Semantic Segmentation

Paper
Add Code

YORO -- Lightweight End to End Visual Grounding

1 code implementation • 15 Nov 2022 • Chih-Hui Ho, Srikar Appalaraju, Bhavan Jasani, R. Manmatha, Nuno Vasconcelos

We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task.

Natural Language Queries Visual Grounding

Paper
Code

GLASS: Global to Local Attention for Scene-Text Spotting

2 code implementations • 5 Aug 2022 • Roi Ronen, Shahar Tsiper, Oron Anschel, Inbal Lavi, Amir Markovitz, R. Manmatha

In recent years, the dominant paradigm for text spotting is to combine the tasks of text detection and recognition into a single end-to-end framework.

Ranked #6 on Text Spotting on Total-Text

Text Detection Text Spotting

Paper
Code

Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer

no code implementations • CVPR 2022 • Yair Kittenplon, Inbal Lavi, Sharon Fogel, Yarin Bar, R. Manmatha, Pietro Perona

Text spotting end-to-end methods have recently gained attention in the literature due to the benefits of jointly optimizing the text detection and recognition components.

Text Detection Text Spotting

Paper
Add Code

LaTr: Layout-Aware Transformer for Scene-Text VQA

1 code implementation • CVPR 2022 • Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, R. Manmatha

Accounting for this, we propose a single objective pre-training scheme that requires only text and spatial cues.

Optical Character Recognition (OCR) Question Answering +1

Paper
Code

DocFormer: End-to-End Transformer for Document Understanding

1 code implementation • ICCV 2021 • Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha

DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer.

Ranked #3 on Document Image Classification on RVL-CDIP

Document Image Classification document understanding

245

Paper
Code

On Calibration of Scene-Text Recognition Models

no code implementations • 23 Dec 2020 • Ron Slossberg, Oron Anschel, Amir Markovitz, Ron Litman, Aviad Aberdam, Shahar Tsiper, Shai Mazor, Jon Wu, R. Manmatha

Although the topic of confidence calibration has been an active research area for the last several decades, the case of structured and sequence prediction calibration has been scarcely explored.

Scene Text Recognition

Paper
Add Code

Sequence-to-Sequence Contrastive Learning for Text Recognition

2 code implementations • CVPR 2021 • Aviad Aberdam, Ron Litman, Shahar Tsiper, Oron Anschel, Ron Slossberg, Shai Mazor, R. Manmatha, Pietro Perona

We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition.

Contrastive Learning Handwritten Text Recognition

Paper
Code

A Comprehensive Study of Deep Video Action Recognition

1 code implementation • 11 Dec 2020 • Yi Zhu, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari, Yuanjun Xiong, Chongruo wu, Zhi Zhang, Joseph Tighe, R. Manmatha, Mu Li

Video action recognition is one of the representative tasks for video understanding.

Action Recognition Temporal Action Localization +1

554

Paper
Code

Document Visual Question Answering Challenge 2020

no code implementations • 20 Aug 2020 • Minesh Mathew, Ruben Tito, Dimosthenis Karatzas, R. Manmatha, C. V. Jawahar

For the task 1 a new dataset is introduced comprising 50, 000 questions-answer(s) pairs defined over 12, 767 document images.

Question Answering Retrieval +2

Paper
Add Code

Improving Semantic Segmentation via Self-Training

no code implementations • 30 Apr 2020 • Yi Zhu, Zhongyue Zhang, Chongruo wu, Zhi Zhang, Tong He, Hang Zhang, R. Manmatha, Mu Li, Alexander Smola

In the case of semantic segmentation, this means that large amounts of pixelwise annotations are required to learn accurate models.

Domain Generalization Segmentation +1

Paper
Add Code

ResNeSt: Split-Attention Networks

35 code implementations • 19 Apr 2020 • Hang Zhang, Chongruo wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, Alexander Smola

It is well known that featuremap attention and multi-path representation are important for visual recognition.

Ranked #8 on Instance Segmentation on COCO test-dev (APM metric)

Image Classification Instance Segmentation +3

29,735

Paper
Code

SCATTER: Selective Context Attentional Scene Text Recognizer

2 code implementations • CVPR 2020 • Ron Litman, Oron Anschel, Shahar Tsiper, Roee Litman, Shai Mazor, R. Manmatha

The first attention step re-weights visual features from a CNN backbone together with contextual features computed by a BiLSTM layer.

Irregular Text Recognition Scene Text Recognition

Paper
Code

Saliency Driven Perceptual Image Compression

no code implementations • 12 Feb 2020 • Yash Patel, Srikar Appalaraju, R. Manmatha

The proposed compression model incorporates the salient regions and optimizes on the proposed perceptual similarity metric.

Image Compression MS-SSIM +3

Paper
Add Code

Human Perceptual Evaluations for Image Compression

no code implementations • 9 Aug 2019 • Yash Patel, Srikar Appalaraju, R. Manmatha

Recently, there has been much interest in deep learning techniques to do image compression and there have been claims that several of these produce better results than engineered compression schemes (such as JPEG, JPEG2000 or BPG).

Image Compression MS-SSIM +1

Paper
Add Code

Deep Perceptual Compression

no code implementations • 18 Jul 2019 • Yash Patel, Srikar Appalaraju, R. Manmatha

In several cases, the MS-SSIM for deep learned techniques is higher than say a conventional, non-deep learned codec such as JPEG-2000 or BPG.

Image Compression MS-SSIM +3

Paper
Add Code

Searching for Apparel Products from Images in the Wild

no code implementations • 4 Jul 2019 • Son Tran, Ming Du, Sampath Chanda, R. Manmatha, Cj Taylor

In particular, Instagram and Twitter influencers often provide images of themselves wearing different outfits and their followers are often inspired to buy similar clothes. We propose a system to automatically find the closest visually similar clothes in the online Catalog (street-to-shop searching).

Descriptive

Paper
Add Code

Compressed Video Action Recognition

1 code implementation • CVPR 2018 • Chao-yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, Philipp Krähenbühl

), we propose to train a deep network directly on the compressed video.

Ranked #46 on Action Classification on Charades (using extra training data)

Action Classification Action Recognition +2

495

Paper
Code

Sampling Matters in Deep Embedding Learning

6 code implementations • ICCV 2017 • Chao-yuan Wu, R. Manmatha, Alexander J. Smola, Philipp Krähenbühl

In addition, we show that a simple margin based loss is sufficient to outperform all other loss functions.

Ranked #5 on Image Retrieval on CARS196

Clustering Face Verification +4

262

Paper
Code

Deep Decision Network for Multi-Class Image Classification

no code implementations • CVPR 2016 • Venkatesh N. Murthy, Vivek Singh, Terrence Chen, R. Manmatha, Dorin Comaniciu

During the learning phase, starting from the root network node, DDN automatically builds a network that splits the data into disjoint clusters of classes which would be handled by the subsequent expert networks.

Classification General Classification +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.