Search Results for author: Makarand Tapaswi

Found 35 papers, 20 papers with code

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

no code implementations15 Jan 2024 Darshan Singh S, Zeeshan Khan, Makarand Tapaswi

We use the SRL and verb information to create rule-based detailed captions, making sure they capture most of the visual concepts.

Eye vs. AI: Human Gaze and Model Attention in Video Memorability

no code implementations26 Nov 2023 Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar

Understanding the factors that determine video memorability has important applications in areas such as educational technology and advertising.

Panoptic Segmentation

Generalized Cross-domain Multi-label Few-shot Learning for Chest X-rays

no code implementations8 Sep 2023 Aroof Aimen, Arsh Verma, Makarand Tapaswi, Narayanan C. Krishnan

Real-world application of chest X-ray abnormality classification requires dealing with several challenges: (i) limited training data; (ii) training and evaluation sets that are derived from different domains; and (iii) classes that appear during training may have partial overlap with classes of interest during evaluation.

Few-Shot Learning Transfer Learning

How you feelin'? Learning Emotions and Mental States in Movie Scenes

1 code implementation CVPR 2023 Dhruv Srivastava, Aditya Kumar Singh, Makarand Tapaswi

Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character.

Emotion Recognition Multi-Label Classification

GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

no code implementations22 Mar 2023 Dhaval Taunk, Lakshya Khanna, Pavan Kandru, Vasudeva Varma, Charu Sharma, Makarand Tapaswi

Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG).

Common Sense Reasoning Knowledge Graphs +1

Test of Time: Instilling Video-Language Models with a Sense of Time

1 code implementation CVPR 2023 Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek

Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.

 Ranked #1 on Video-Text Retrieval on Test-of-Time (using extra training data)

Video-Text Retrieval Video Understanding

Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations

no code implementations2 Dec 2022 Jaidev Shriram, Makarand Tapaswi, Vinoo Alluri

Reading, much like music listening, is an immersive experience that transports readers while taking them on an emotional journey.

Can we Adopt Self-supervised Pretraining for Chest X-Rays?

no code implementations23 Nov 2022 Arsh Verma, Makarand Tapaswi

Chest radiograph (or Chest X-Ray, CXR) is a popular medical imaging modality that is used by radiologists across the world to diagnose heart or lung conditions.

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

1 code implementation17 Nov 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.

Object Relation

Unsupervised Audio-Visual Lecture Segmentation

no code implementations29 Oct 2022 Darshan Singh S, Anchit Gupta, C. V. Jawahar, Makarand Tapaswi

We formulate lecture segmentation as an unsupervised task that leverages visual, textual, and OCR cues from the lecture, while clip representations are fine-tuned on a pretext self-supervised task of matching the narration with the temporally aligned visual content.

Navigate Optical Character Recognition (OCR) +1

Grounded Video Situation Recognition

no code implementations19 Oct 2022 Zeeshan Khan, C. V. Jawahar, Makarand Tapaswi

Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities.

Descriptive Structured Prediction +1

Instruction-driven history-aware policies for robotic manipulations

2 code implementations11 Sep 2022 Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.

Ranked #2 on Robot Manipulation on RLBench (Succ. Rate (10 tasks, 100 demos/task) metric)

Robot Manipulation

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

1 code implementation24 Aug 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.

Language Modelling Navigate +3

Learning Object Manipulation Skills from Video via Approximate Differentiable Physics

2 code implementations3 Aug 2022 Vladimir Petrik, Mohammad Nomaan Qureshi, Josef Sivic, Makarand Tapaswi

We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something.

3D Reconstruction Friction +1

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

1 code implementation CVPR 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.

Efficient Exploration Navigate +2

Airbert: In-domain Pretraining for Vision-and-Language Navigation

2 code implementations ICCV 2021 Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Navigate Referring Expression +1

Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

1 code implementation13 Nov 2020 Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic

We evaluate our method on simple single- and two-object actions from the Something-Something dataset.

Object

Deep Multimodal Feature Encoding for Video Ordering

1 code implementation5 Apr 2020 Vivek Sharma, Makarand Tapaswi, Rainer Stiefelhagen

True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions.

Action Recognition

Learning Interactions and Relationships between Movie Characters

1 code implementation CVPR 2020 Anna Kukleva, Makarand Tapaswi, Ivan Laptev

Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels.

The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries

1 code implementation30 Dec 2019 Atef Chaudhury, Makarand Tapaswi, Seung Wook Kim, Sanja Fidler

Understanding stories is a challenging reading comprehension problem for machines as it requires reading a large volume of text and following long-range dependencies.

Abstractive Text Summarization Question Answering +1

Video Face Clustering with Unknown Number of Clusters

1 code implementation ICCV 2019 Makarand Tapaswi, Marc T. Law, Sanja Fidler

Understanding videos such as TV series and movies requires analyzing who the characters are and what they are doing.

Clustering Face Clustering +1

Visual Reasoning by Progressive Module Networks

1 code implementation ICLR 2019 Seung Wook Kim, Makarand Tapaswi, Sanja Fidler

Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output.

Visual Reasoning

Now You Shake Me: Towards Automatic 4D Cinema

no code implementations CVPR 2018 Yuhao Zhou, Makarand Tapaswi, Sanja Fidler

We are interested in enabling automatic 4D cinema by parsing physical and special effects from untrimmed movies.

MovieGraphs: Towards Understanding Human-Centric Situations from Videos

no code implementations CVPR 2018 Paul Vicol, Makarand Tapaswi, Lluis Castrejon, Sanja Fidler

Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed, graph-based annotations of social situations depicted in movie clips.

Common Sense Reasoning

Relaxed Earth Mover's Distances for Chain- and Tree-connected Spaces and their use as a Loss Function in Deep Learning

no code implementations22 Nov 2016 Manuel Martinez, Monica Haurilet, Ziad Al-Halah, Makarand Tapaswi, Rainer Stiefelhagen

The Earth Mover's Distance (EMD) computes the optimal cost of transforming one distribution into another, given a known transport metric between them.

Small Data Image Classification

Book2Movie: Aligning Video Scenes With Book Chapters

no code implementations CVPR 2015 Makarand Tapaswi, Martin Bauml, Rainer Stiefelhagen

Such an alignment facilitates finding differences between the adaptation and the original source, and also acts as a basis for deriving rich descriptions from the novel for the video clips.

Video Alignment

StoryGraphs: Visualizing Character Interactions as a Timeline

1 code implementation CVPR 2014 Makarand Tapaswi, Martin Bauml, Rainer Stiefelhagen

We present a novel way to automatically summarize and represent the storyline of a TV episode by visualizing character interactions as a chart.

Person Identification

Cannot find the paper you are looking for? You can Submit a new open access paper.