Search Results for author: Cordelia Schmid

Found 190 papers, 76 papers with code

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

no code implementations9 Apr 2024 Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework.

Question Answering Video Question Answering

Learning Correlation Structures for Vision Transformers

no code implementations5 Apr 2024 Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention.

Action Classification Action Recognition +2

SUGAR: Pre-training 3D Visual Representations for Robotics

no code implementations1 Apr 2024 ShiZhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks, namely cross-modal knowledge distillation for semantic learning, masked point modeling to understand geometry structures, grasping pose synthesis for object affordance, 3D instance segmentation and referring expression grounding to analyze cluttered scenes.

3D Instance Segmentation 3D Object Recognition +5

Streaming Dense Video Captioning

1 code implementation1 Apr 2024 Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.

Dense Video Captioning

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

2 code implementations4 Mar 2024 Mathilde Caron, Ahmet Iscen, Alireza Fathi, Cordelia Schmid

In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia.

Time-, Memory- and Parameter-Efficient Visual Adaptation

no code implementations5 Feb 2024 Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.

Video Classification

RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks

no code implementations11 Jan 2024 Partha Ghosh, Soubhik Sanyal, Cordelia Schmid, Bernhard Schölkopf

To capture these dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a singular latent code to model an entire video sequence.

Generative Adversarial Network Optical Flow Estimation +1

Pixel Aligned Language Models

no code implementations14 Dec 2023 Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.

Language Modelling

Dense Optical Tracking: Connecting the Dots

1 code implementation1 Dec 2023 Guillaume Le Moing, Jean Ponce, Cordelia Schmid

Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing. github. io/dot .

Optical Flow Estimation Point Tracking

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

1 code implementation27 Sep 2023 ShiZhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev

The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics.

Multi-Task Learning Robot Manipulation

VidChapters-7M: Video Chapters at Scale

no code implementations NeurIPS 2023 Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid

To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.

Dense Video Captioning Navigate

CoVR: Learning Composed Video Retrieval from Web Video Captions

1 code implementation28 Aug 2023 Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image.

Composed Video Retrieval (CoVR) Language Modelling +3

POCO: 3D Pose and Shape Estimation with Confidence

no code implementations24 Aug 2023 Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, Dimitrios Tzionas

To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass.

Action Recognition Pose Estimation +1

UnLoc: A Unified Framework for Video Localization Tasks

1 code implementation ICCV 2023 Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid

While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.

Action Segmentation Moment Retrieval +5

Object Goal Navigation with Recursive Implicit Maps

no code implementations10 Aug 2023 ShiZhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid

Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments.

Navigate Object

Robust Visual Sim-to-Real Transfer for Robotic Manipulation

no code implementations28 Jul 2023 Ricardo Garcia, Robin Strudel, ShiZhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks.

object-detection Object Detection +1

Does Visual Pretraining Help End-to-End Reasoning?

no code implementations NeurIPS 2023 Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.

Image Classification Object +3

How can objects help action recognition?

1 code implementation CVPR 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.

Action Recognition Object

Dense Video Object Captioning from Disjoint Supervision

1 code implementation20 Jun 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video.

Object Sentence +2

Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

1 code implementation ICCV 2023 Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata

The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3.

Classification Language Modelling +1

Learning Video-Conditioned Policies for Unseen Manipulation Tasks

no code implementations10 May 2023 Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled robot trajectories and corresponding robot videos.

Action Recognition Robot Manipulation +1

End-to-End Spatio-Temporal Action Localisation with Video Transformers

no code implementations24 Apr 2023 Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab

The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks.

 Ranked #1 on Action Recognition on AVA v2.1 (using extra training data)

Action Detection Action Recognition +1

Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

no code implementations CVPR 2023 Ahmet Iscen, Alireza Fathi, Cordelia Schmid

Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems.

 Ranked #1 on Image Classification on WebVision-1000 (using extra training data)

Learning with noisy labels Long-tail Learning

Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval

no code implementations6 Apr 2023 Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata

In this work, we introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.

Cross-Modal Retrieval Object +2

Bridging the Gap between Model Explanations in Partially Annotated Multi-label Classification

2 code implementations CVPR 2023 Youngwook Kim, Jae Myung Kim, Jieun Jeong, Cordelia Schmid, Zeynep Akata, Jungwoo Lee

Based on these findings, we propose to boost the attribution scores of the model trained with partial labels to make its explanation resemble that of the model trained with full labels.

Classification Multi-Label Classification

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

no code implementations CVPR 2023 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

(ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech).

Automatic Speech Recognition Domain Adaptation +2

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

3 code implementations CVPR 2023 Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

 Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)

Dense Video Captioning Language Modelling +1

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

2 code implementations20 Dec 2022 Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden

One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images.

Multimodal Machine Translation Translation

Audiovisual Masked Autoencoders

2 code implementations ICCV 2023 Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning?

 Ranked #1 on Audio Classification on EPIC-KITCHENS-100 (using extra training data)

Audio Classification Representation Learning

WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

1 code implementation ICCV 2023 Guillaume Le Moing, Jean Ponce, Cordelia Schmid

This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones.

SSIM

AVATAR submission to the Ego4D AV Transcription Challenge

no code implementations18 Nov 2022 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022.

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

1 code implementation17 Nov 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.

Object Relation

Learning Reward Functions for Robotic Manipulation by Observing Humans

no code implementations16 Nov 2022 Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid

Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.

Contrastive Learning

A Memory Transformer Network for Incremental Learning

no code implementations10 Oct 2022 Ahmet Iscen, Thomas Bird, Mathilde Caron, Alireza Fathi, Cordelia Schmid

We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.

Class Incremental Learning Incremental Learning

Instruction-driven history-aware policies for robotic manipulations

2 code implementations11 Sep 2022 Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.

Ranked #2 on Robot Manipulation on RLBench (Succ. Rate (10 tasks, 100 demos/task) metric)

Robot Manipulation

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

1 code implementation24 Aug 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.

Language Modelling Navigate +3

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

1 code implementation26 Jul 2022 Zerui Chen, Yana Hasson, Cordelia Schmid, Ivan Laptev

We show that such aligned SDFs better focus on reconstructing shape details and improve reconstruction accuracy both for hands and objects.

hand-object pose Object Reconstruction

M&M Mix: A Multimodal Multiview Transformer Ensemble

no code implementations20 Jun 2022 Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.

Ranked #2 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Recognition Video Recognition

AVATAR: Unconstrained Audiovisual Speech Recognition

1 code implementation15 Jun 2022 Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Learning to Answer Visual Questions from Web Videos

1 code implementation10 May 2022 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.

Question Answering Question Generation +4

Weakly-supervised segmentation of referring expressions

no code implementations10 May 2022 Robin Strudel, Ivan Laptev, Cordelia Schmid

Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions.

Image Segmentation Referring Expression +5

Assembly Planning from Observations under Physical Constraints

no code implementations20 Apr 2022 Thomas Chabal, Robin Strudel, Etienne Arlaud, Jean Ponce, Cordelia Schmid

This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation.

Object object-detection +2

Learning Audio-Video Modalities from Image Captions

no code implementations1 Apr 2022 Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.

Image Captioning Retrieval +4

The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

no code implementations28 Feb 2022 Pia Bideau, Erik Learned-Miller, Cordelia Schmid, Karteek Alahari

In this work, we argue that the coupling of camera rotation and camera translation can create complex motion fields that are difficult for a deep network to untangle directly.

Motion Segmentation

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

1 code implementation CVPR 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.

Efficient Exploration Navigate +2

Multiview Transformers for Video Recognition

1 code implementation CVPR 2022 Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.

Ranked #5 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Classification Action Recognition +1

Masking Modalities for Cross-modal Video Retrieval

no code implementations1 Nov 2021 Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.

Retrieval Video Retrieval

Variational Perturbations for Visual Feature Attribution

no code implementations29 Sep 2021 Jae Myung Kim, Eunji Kim, Sungroh Yoon, Jungwoo Lee, Cordelia Schmid, Zeynep Akata

Explaining a complex black-box system in a post-hoc manner is important to understand its predictions.

Airbert: In-domain Pretraining for Vision-and-Language Navigation

2 code implementations ICCV 2021 Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Navigate Referring Expression +1

CCVS: Context-aware Controllable Video Synthesis

1 code implementation NeurIPS 2021 Guillaume Le Moing, Jean Ponce, Cordelia Schmid

The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module.

Optical Flow Estimation Self-Supervised Learning +2

Goal-Conditioned Reinforcement Learning with Imagined Subgoals

no code implementations1 Jul 2021 Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning.

reinforcement-learning Reinforcement Learning (RL)

Attention Bottlenecks for Multimodal Fusion

1 code implementation NeurIPS 2021 Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.

Action Classification Action Recognition +2

Residual Reinforcement Learning from Demonstrations

no code implementations15 Jun 2021 Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid

Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal.

reinforcement-learning Reinforcement Learning (RL)

Large-Scale Unsupervised Object Discovery

1 code implementation NeurIPS 2021 Huy V. Vo, Elena Sizikova, Cordelia Schmid, Patrick Pérez, Jean Ponce

Extensive experiments on COCO and OpenImages show that, in the single-object discovery setting where a single prominent object is sought in each image, the proposed LOD (Large-scale Object Discovery) approach is on par with, or better than the state of the art for medium-scale datasets (up to 120K images), and over 37% better than the only other algorithms capable of scaling up to 1. 7M images.

Multi-object discovery Object +2

Episodic Transformer for Vision-and-Language Navigation

1 code implementation ICCV 2021 Alexander Pashevich, Cordelia Schmid, Chen Sun

We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.

Vision and Language Navigation

Class-Balanced Distillation for Long-Tailed Visual Recognition

3 code implementations12 Apr 2021 Ahmet Iscen, André Araujo, Boqing Gong, Cordelia Schmid

An effective and simple approach to long-tailed visual recognition is to learn feature representations and a classifier separately, with instance and class-balanced sampling, respectively.

Image Classification Knowledge Distillation +1

Improving robustness against common corruptions with frequency biased models

no code implementations ICCV 2021 Tonmoy Saikia, Cordelia Schmid, Thomas Brox

CNNs perform remarkably well when the training and test distributions are i. i. d, but unseen image corruptions can cause a surprisingly large drop in performance.

Data Augmentation object-detection +1

Unified Graph Structured Models for Video Understanding

no code implementations ICCV 2021 Anurag Arnab, Chen Sun, Cordelia Schmid

Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.

Action Detection Graph Classification +3

ViViT: A Video Vision Transformer

8 code implementations ICCV 2021 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.

Ranked #8 on Action Classification on MiT (Top 5 Accuracy metric, using extra training data)

Action Classification Action Recognition +4

Learning Temporal Dynamics from Cycles in Narrated Video

no code implementations ICCV 2021 Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun

Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community.

Image Matching with Scale Adjustment

no code implementations10 Dec 2020 Yves Dufournaud, Cordelia Schmid, Radu Horaud

In this paper we address the problem of matching two images with two different resolutions: a high-resolution image and a low-resolution one.

Look Before you Speak: Visually Contextualized Utterances

no code implementations CVPR 2021 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

1 code implementation ICCV 2021 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.

Question Answering Question Generation +4

Learning Obstacle Representations for Neural Motion Planning

1 code implementation25 Aug 2020 Robin Strudel, Ricardo Garcia, Justin Carpentier, Jean-Paul Laumond, Ivan Laptev, Cordelia Schmid

Motion planning and obstacle avoidance is a key challenge in robotics applications.

Robotics

Multi-modal Transformer for Video Retrieval

1 code implementation ECCV 2020 Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.

 Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT (text-to-video Mean Rank metric, using extra training data)

Natural Language Queries Retrieval +2

Consistency Guided Scene Flow Estimation

no code implementations ECCV 2020 Yuhua Chen, Luc van Gool, Cordelia Schmid, Cristian Sminchisescu

To handle inherent modeling error in the consistency loss (e. g. Lambertian assumptions) and for better generalization, we further introduce a learned, output refinement network, which takes the initial predictions, the loss, and the gradient as input, and efficiently predicts a correlated output update.

Scene Flow Estimation

TAO: A Large-Scale Benchmark for Tracking Any Object

no code implementations ECCV 2020 Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan

To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum.

Multi-Object Tracking Object +2

What Makes for Good Views for Contrastive Learning?

1 code implementation NeurIPS 2020 Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.

Contrastive Learning Data Augmentation +8

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

3 code implementations CVPR 2020 Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cong-Cong Li, Cordelia Schmid

Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).

Self-Driving Cars

Learning visual policies for building 3D shape categories

no code implementations15 Apr 2020 Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

We then show the success of our visual policies for building arches from different primitives.

Object

Memory-Efficient Incremental Learning Through Feature Adaptation

no code implementations ECCV 2020 Ahmet Iscen, Jeffrey Zhang, Svetlana Lazebnik, Cordelia Schmid

We assume that the model is updated incrementally for new classes as new data becomes available sequentially. This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images.

Incremental Learning

Speech2Action: Cross-modal Supervision for Action Recognition

no code implementations CVPR 2020 Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.

Action Recognition

Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification

1 code implementation ECCV 2020 Nikita Dvornik, Cordelia Schmid, Julien Mairal

Popular approaches for few-shot classification consist of first learning a generic data representation based on a large annotated dataset, before adapting the representation to new classes given only a few labeled samples.

feature selection Few-Shot Image Classification +2

Beyond the Camera: Neural Networks in World Coordinates

no code implementations12 Mar 2020 Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari

Eye movement and strategic placement of the visual field onto the retina, gives animals increased resolution of the scene and suppresses distracting information.

Action Recognition Video Stabilization +1

Optimized Generic Feature Learning for Few-shot Classification across Domains

no code implementations22 Jan 2020 Tonmoy Saikia, Thomas Brox, Cordelia Schmid

To learn models or features that generalize across tasks and domains is one of the grand goals of machine learning.

BIG-bench Machine Learning Classification +3

Synthetic Humans for Action Recognition from Unseen Viewpoints

1 code implementation9 Dec 2019 Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman

Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.

Action Classification Action Recognition +2

Learning to Track Any Object

no code implementations25 Oct 2019 Achal Dave, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan

Moreover, at test time the same network can be applied to detection and tracking, resulting in a unified approach for the two tasks.

Instance Segmentation Object +5

White-box vs Black-box: Bayes Optimal Strategies for Membership Inference

no code implementations29 Aug 2019 Alexandre Sablayrolles, Matthijs Douze, Yann Ollivier, Cordelia Schmid, Hervé Jégou

Membership inference determines, given a sample and trained parameters of a machine learning model, whether the sample was part of the training set.

Self-supervised Learning with Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera

no code implementations ICCV 2019 Yuhua Chen, Cordelia Schmid, Cristian Sminchisescu

We present GLNet, a self-supervised framework for learning depth, optical flow, camera pose and intrinsic parameters from monocular video - addressing the difficulty of acquiring realistic ground-truth for such tasks.

Monocular Depth Estimation Optical Flow Estimation +3

Learning Video Representations using Contrastive Bidirectional Transformer

no code implementations13 Jun 2019 Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid

This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

A Study on Action Detection in the Wild

no code implementations29 Apr 2019 Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

In this work we study the problem of action detection in a highly-imbalanced dataset.

Action Detection

Learning to Augment Synthetic Images for Sim2Real Policy Transfer

1 code implementation18 Mar 2019 Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

Policies learned in simulators, however, do not transfer well to real scenes given the domain gap between real and synthetic data.

Object Localization

Adaptive Density Estimation for Generative Models

no code implementations NeurIPS 2019 Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

We show that our model significantly improves over existing hybrid models: offering GAN-like samples, IS and FID scores that are competitive with fully adversarial models, and improved likelihood scores.

Density Estimation

Detecting unseen visual relations using analogies

no code implementations ICCV 2019 Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.

Retrieval

A Structured Model For Action Detection

no code implementations CVPR 2019 Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand.

Action Detection Video Understanding

Modulated Policy Hierarchies

no code implementations30 Nov 2018 Alexander Pashevich, Danijar Hafner, James Davidson, Rahul Sukthankar, Cordelia Schmid

To achieve this, we study different modulation signals and exploration for hierarchical controllers.

Reinforcement Learning (RL)

Coverage and Quality Driven Training of Generative Image Models

no code implementations27 Sep 2018 Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

First, we propose a model that extends variational autoencoders by using deterministic invertible transformation layers to map samples from the decoder to the image space.

Déjà Vu: an empirical evaluation of the memorization properties of ConvNets

no code implementations ICLR 2019 Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting.

Data Augmentation Memorization

On the Importance of Visual Context for Data Augmentation in Scene Understanding

no code implementations6 Sep 2018 Nikita Dvornik, Julien Mairal, Cordelia Schmid

In this work, we consider object detection, semantic and instance segmentation and augment the training images by blending objects in existing scenes, using instance segmentation annotations.

Data Augmentation Instance Segmentation +7

Actor-Centric Relation Network

1 code implementation ECCV 2018 Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid

A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

Action Classification Action Detection +5

End-to-End Incremental Learning

5 code implementations ECCV 2018 Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, Karteek Alahari

Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally.

Image Classification Incremental Learning

How good is my GAN?

no code implementations ECCV 2018 Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

Generative adversarial networks (GANs) are one of the most popular methods for generating images today.

General Classification Image Classification

Modeling Visual Context is Key to Augmenting Object Detection Datasets

2 code implementations ECCV 2018 Nikita Dvornik, Julien Mairal, Cordelia Schmid

For this approach to be successful, we show that modeling appropriately the visual context surrounding objects is crucial to place them in the right environment.

Data Augmentation object-detection +1

Modeling Spatio-Temporal Human Track Structure for Action Localization

no code implementations28 Jun 2018 Guilhem Chéron, Anton Osokin, Ivan Laptev, Cordelia Schmid

In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks.

Human Detection Optical Flow Estimation +3

Spreading vectors for similarity search

2 code implementations ICLR 2019 Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Discretizing multi-dimensional data distributions is a fundamental step of modern indexing methods.

Quantization

Unsupervised Learning of Artistic Styles with Archetypal Style Analysis

no code implementations NeurIPS 2018 Daan Wynen, Cordelia Schmid, Julien Mairal

In this paper, we introduce an unsupervised learning approach to automatically discover, summarize, and manipulate artistic styles from large collections of paintings.

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

no code implementations25 Apr 2018 Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68, 536 activity instances in 68. 8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available.

General Classification Video Classification +1

Actor and Observer: Joint Modeling of First and Third-Person Videos

1 code implementation CVPR 2018 Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor).

Action Recognition Temporal Action Localization

Image-based Synthesis for Deep 3D Human Pose Estimation

no code implementations12 Feb 2018 Grégory Rogez, Cordelia Schmid

Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.

3D Human Pose Estimation 3D Pose Estimation +1

Learning to Segment Moving Objects

no code implementations1 Dec 2017 Pavel Tokmakov, Cordelia Schmid, Karteek Alahari

We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to correct errors in motion estimation, and (iii) temporal consistency, which imposes additional constraints on the segmentation.

Motion Estimation Motion Segmentation +4

Incremental Learning of Object Detectors without Catastrophic Forgetting

3 code implementations ICCV 2017 Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

Despite their success for object detection, convolutional neural networks are ill-equipped for incremental learning, i. e., adapting the original model trained on a set of classes to additionally detect objects of new classes, in the absence of the initial training data.

Incremental Learning Object +2

Detecting Parts for Action Localization

no code implementations19 Jul 2017 Nicolas Chesneau, Grégory Rogez, Karteek Alahari, Cordelia Schmid

In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i. e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations.

Action Localization

Developing the Path Signature Methodology and its Application to Landmark-based Human Action Recognition

no code implementations13 Jul 2017 Weixin Yang, Terry Lyons, Hao Ni, Cordelia Schmid, Lianwen Jin

To this end, we regard the evolving landmark data as a high-dimensional path and apply non-linear path signature techniques to provide an expressive, robust, non-linear, and interpretable representation for the sequential events.

Action Classification Action Recognition In Videos +1

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

8 code implementations CVPR 2018 Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Actin Detection Action Detection +3

SCNet: Learning Semantic Correspondence

1 code implementation ICCV 2017 Kai Han, Rafael S. Rezende, Bumsub Ham, Kwan-Yee K. Wong, Minsu Cho, Cordelia Schmid, Jean Ponce

This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category.

Semantic correspondence

Action Tubelet Detector for Spatio-Temporal Action Localization

2 code implementations ICCV 2017 Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, Cordelia Schmid

We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i. e., sequences of bounding boxes with associated scores.

Spatio-Temporal Action Localization Temporal Action Localization

SfM-Net: Learning of Structure and Motion from Video

no code implementations25 Apr 2017 Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki

We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations.

Motion Estimation Object +1

Learning Video Object Segmentation with Visual Memory

no code implementations ICCV 2017 Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

The module to build a "visual memory" in video, i. e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences.

Motion Segmentation Object +3

Proposal Flow: Semantic Correspondences from Object Proposals

no code implementations21 Mar 2017 Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce

Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout.

Object

Learning from Synthetic Humans

2 code implementations CVPR 2017 Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, Cordelia Schmid

In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data.

2D Human Pose Estimation 3D Human Pose Estimation +2

Learning Motion Patterns in Videos

no code implementations CVPR 2017 Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

The problem of determining whether an object is in motion, irrespective of camera motion, is far from being solved.

Motion Segmentation Optical Flow Estimation +3

MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

no code implementations NeurIPS 2016 Grégory Rogez, Cordelia Schmid

Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.

Ranked #117 on 3D Human Pose Estimation on Human3.6M (PA-MPJPE metric)

3D Human Pose Estimation 3D Pose Estimation +1

Human Action Localization with Sparse Spatial Supervision

no code implementations17 May 2016 Philippe Weinzaepfel, Xavier Martin, Cordelia Schmid

We introduce an approach for spatio-temporal human action localization using sparse spatial supervision.

Action Localization

Weakly-Supervised Semantic Segmentation using Motion Cues

no code implementations23 Mar 2016 Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images.

Image Segmentation Weakly supervised Semantic Segmentation +1

Convolutional Patch Representations for Image Retrieval: an Unsupervised Approach

no code implementations1 Mar 2016 Mattis Paulin, Julien Mairal, Matthijs Douze, Zaid Harchaoui, Florent Perronnin, Cordelia Schmid

Convolutional neural networks (CNNs) have recently received a lot of attention due to their ability to model local stationary structures in natural images in a multi-scale fashion, when learning all model parameters with supervision.

Image Classification Image Retrieval +1

Proposal Flow

no code implementations CVPR 2016 Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce

Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout.~Semantic flow methods are designed to handle images depicting different instances of the same object or scene category.

Object

Approximate Fisher Kernels of non-iid Image Models for Image Categorization

no code implementations3 Oct 2015 Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

It has been experimentally observed that the performance of BoW and FV representations can be improved by employing discounting transformations such as power normalization.

Image Categorization

Expanded Parts Model for Semantic Description of Humans in Still Images

no code implementations14 Sep 2015 Gaurav Sharma, Frederic Jurie, Cordelia Schmid

We validate our method on three recent challenging datasets of human attributes and actions.

Beat-Event Detection in Action Movie Franchises

no code implementations15 Aug 2015 Danila Potapov, Matthijs Douze, Jerome Revaud, Zaid Harchaoui, Cordelia Schmid

While important advances were recently made towards temporally localizing and recognizing specific human actions or activities in videos, efficient detection and classification of long video chunks belonging to semantically defined categories such as "pursuit" or "romance" remains challenging. We introduce a new dataset, Action Movie Franchises, consisting of a collection of Hollywood action movie franchises.

Classification Event Detection +1

Learning to track for spatio-temporal action localization

no code implementations ICCV 2015 Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid

We present experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB and UCF-101 action localization datasets, where our approach outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.

Spatio-Temporal Action Localization Temporal Action Localization +1

Learning to Detect Motion Boundaries

no code implementations CVPR 2015 Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, Cordelia Schmid

We compare the results obtained with several state-of-the-art optical flow approaches and study the impact of the different cues used in the random forest. Furthermore, we introduce a new dataset, the YouTube Motion Boundaries dataset (YMB), that comprises 60 sequences taken from real-world videos with manually annotated motion boundaries.

Boundary Detection Optical Flow Estimation

Weakly-Supervised Alignment of Video With Text

no code implementations ICCV 2015 Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid

Given vectorial features for both video and text, we propose to cast this task as a temporal assignment problem, with an implicit linear mapping between the two feature modalities.

Sentence

Unsupervised Object Discovery and Tracking in Video Collections

no code implementations ICCV 2015 Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, Cordelia Schmid

This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision.

Object Object Discovery +1

Label-Embedding for Image Classification

2 code implementations30 Mar 2015 Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid

Attributes act as intermediate representations that enable parameter sharing between classes, a must when training data is scarce.

Attribute Classification +4

Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning

no code implementations3 Mar 2015 Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations.

Multiple Instance Learning Object +2

Unsupervised Object Discovery and Localization in the Wild: Part-based Matching with Bottom-up Region Proposals

no code implementations CVPR 2015 Minsu Cho, Suha Kwak, Cordelia Schmid, Jean Ponce

This paper addresses unsupervised discovery and localization of dominant objects from a noisy image collection with multiple object classes.

Object Object Discovery

Weakly Supervised Action Labeling in Videos Under Ordering Constraints

no code implementations4 Jul 2014 Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic

We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script.

Convolutional Kernel Networks

no code implementations NeurIPS 2014 Julien Mairal, Piotr Koniusz, Zaid Harchaoui, Cordelia Schmid

An important goal in visual recognition is to devise image representations that are invariant to particular transformations.

Image Classification

Transformation Pursuit for Image Classification

no code implementations CVPR 2014 Mattis Paulin, Jerome Revaud, Zaid Harchaoui, Florent Perronnin, Cordelia Schmid

We propose a principled algorithm – Image Transformation Pursuit (ITP) – for the automatic selection of a compact set of transformations.

Classification General Classification +1

Multi-fold MIL Training for Weakly Supervised Object Localization

no code implementations CVPR 2014 Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations.

Multiple Instance Learning Object +2

Efficient Action Localization with Approximately Normalized Fisher Vectors

no code implementations CVPR 2014 Dan Oneata, Jakob Verbeek, Cordelia Schmid

Transformation of the FV by power and L2 normalizations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classification and retrieval tasks.

Action Recognition General Classification +4

Expanded Parts Model for Human Attribute and Action Recognition in Still Images

no code implementations CVPR 2013 Gaurav Sharma, Frederic Jurie, Cordelia Schmid

We propose a new model for recognizing human attributes (e. g. wearing a suit, sitting, short hair) and actions (e. g. running, riding a horse) in still images.

Action Recognition In Still Images Attribute

Label-Embedding for Attribute-Based Classification

no code implementations CVPR 2013 Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid

The label embedding framework offers other advantages such as the ability to leverage alternative sources of information in addition to attributes (e. g. class hierarchies) or to transition smoothly from zero-shot learning to learning with large quantities of data.

Attribute Classification +3

Event Retrieval in Large Video Collections with Circulant Temporal Encoding

no code implementations CVPR 2013 Jerome Revaud, Matthijs Douze, Cordelia Schmid, Herve Jegou

Furthermore, we extend product quantization to complex vectors in order to compress our descriptors, and to compare them in the compressed domain.

Copy Detection Quantization +1

Cannot find the paper you are looking for? You can Submit a new open access paper.