Search Results for author: Cordelia Schmid

Found 190 papers, 76 papers with code

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

no code implementations • 9 Apr 2024 • Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework.

Question Answering Video Question Answering

Paper
Add Code

Learning Correlation Structures for Vision Transformers

no code implementations • 5 Apr 2024 • Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention.

Ranked #4 on Action Recognition on Diving-48

Action Classification Action Recognition +2

Paper
Add Code

SUGAR: Pre-training 3D Visual Representations for Robotics

no code implementations • 1 Apr 2024 • ShiZhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks, namely cross-modal knowledge distillation for semantic learning, masked point modeling to understand geometry structures, grasping pose synthesis for object affordance, 3D instance segmentation and referring expression grounding to analyze cluttered scenes.

3D Instance Segmentation 3D Object Recognition +5

Paper
Add Code

Streaming Dense Video Captioning

1 code implementation • 1 Apr 2024 • Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.

Dense Video Captioning

2,983

Paper
Code

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

2 code implementations • 4 Mar 2024 • Mathilde Caron, Ahmet Iscen, Alireza Fathi, Cordelia Schmid

In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia.

2,983

Paper
Code

SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code

no code implementations • 2 Mar 2024 • Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, Alireza Fathi

SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene.

Language Modelling Large Language Model

Paper
Add Code

Time-, Memory- and Parameter-Efficient Visual Adaptation

no code implementations • 5 Feb 2024 • Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.

Video Classification

Paper
Add Code

RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks

no code implementations • 11 Jan 2024 • Partha Ghosh, Soubhik Sanyal, Cordelia Schmid, Bernhard Schölkopf

To capture these dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a singular latent code to model an entire video sequence.

Generative Adversarial Network Optical Flow Estimation +1

Paper
Add Code

Pixel Aligned Language Models

no code implementations • 14 Dec 2023 • Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.

Language Modelling

Paper
Add Code

Dense Optical Tracking: Connecting the Dots

1 code implementation • 1 Dec 2023 • Guillaume Le Moing, Jean Ponce, Cordelia Schmid

Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing. github. io/dot .

Optical Flow Estimation Point Tracking

179

Paper
Code

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

1 code implementation • 27 Sep 2023 • ShiZhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev

The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics.

Ranked #5 on Robot Manipulation on RLBench

Multi-Task Learning Robot Manipulation

Paper
Code

VidChapters-7M: Video Chapters at Scale

no code implementations • NeurIPS 2023 • Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid

To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.

Dense Video Captioning Navigate

Paper
Add Code

CoVR: Learning Composed Video Retrieval from Web Video Captions

1 code implementation • 28 Aug 2023 • Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image.

Ranked #1 on Composed Video Retrieval (CoVR) on WebVid-CoVR

Composed Video Retrieval (CoVR) Language Modelling +3

Paper
Code

POCO: 3D Pose and Shape Estimation with Confidence

no code implementations • 24 Aug 2023 • Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, Dimitrios Tzionas

To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass.

Action Recognition Pose Estimation +1

Paper
Add Code

UnLoc: A Unified Framework for Video Localization Tasks

1 code implementation • ICCV 2023 • Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid

While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.

Ranked #1 on Action Segmentation on COIN

Action Segmentation Moment Retrieval +5

2,983

Paper
Code

Object Goal Navigation with Recursive Implicit Maps

no code implementations • 10 Aug 2023 • ShiZhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid

Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments.

Navigate Object

Paper
Add Code

Robust Visual Sim-to-Real Transfer for Robotic Manipulation

no code implementations • 28 Jul 2023 • Ricardo Garcia, Robin Strudel, ShiZhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks.

object-detection Object Detection +1

Paper
Add Code

Does Visual Pretraining Help End-to-End Reasoning?

no code implementations • NeurIPS 2023 • Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.

Image Classification Object +3

Paper
Add Code

How can objects help action recognition?

1 code implementation • CVPR 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.

Action Recognition Object

2,983

Paper
Code

Dense Video Object Captioning from Disjoint Supervision

1 code implementation • 20 Jun 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video.

Object Sentence +2

2,983

Paper
Code

Retrieval-Enhanced Contrastive Vision-Text Models

no code implementations • 12 Jun 2023 • Ahmet Iscen, Mathilde Caron, Alireza Fathi, Cordelia Schmid

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems.

Ranked #3 on Fine-Grained Image Recognition on OVEN

Fine-Grained Image Recognition Retrieval

Paper
Add Code

Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

1 code implementation • ICCV 2023 • Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata

The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3.

Classification Language Modelling +1

Paper
Code

Modular Visual Question Answering via Code Generation

1 code implementation • 8 Jun 2023 • Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein

We present a framework that formulates visual question answering as modular code generation.

Code Generation In-Context Learning +2

Paper
Code

Learning Video-Conditioned Policies for Unseen Manipulation Tasks

no code implementations • 10 May 2023 • Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled robot trajectories and corresponding robot videos.

Action Recognition Robot Manipulation +1

Paper
Add Code

End-to-End Spatio-Temporal Action Localisation with Video Transformers

no code implementations • 24 Apr 2023 • Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab

The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks.

Ranked #1 on Action Recognition on AVA v2.1 (using extra training data)

Action Detection Action Recognition +1

Paper
Add Code

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

no code implementations • CVPR 2023 • Zerui Chen, ShiZhe Chen, Cordelia Schmid, Ivan Laptev

In particular, we address reconstruction of hands and manipulated objects from monocular RGB images.

Ranked #5 on hand-object pose on DexYCB

3D Reconstruction 3D Shape Reconstruction +2

Paper
Add Code

Verbs in Action: Improving verb understanding in video-language models

1 code implementation • ICCV 2023 • Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time.

Ranked #11 on Zero-Shot Video Question Answer on NExT-QA

Contrastive Learning Text Matching +2

2,982

Paper
Code

Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

no code implementations • CVPR 2023 • Ahmet Iscen, Alireza Fathi, Cordelia Schmid

Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems.

Ranked #1 on Image Classification on WebVision-1000 (using extra training data)

Learning with noisy labels Long-tail Learning

Paper
Add Code

Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval

no code implementations • 6 Apr 2023 • Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata

In this work, we introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.

Cross-Modal Retrieval Object +2

Paper
Add Code

Bridging the Gap between Model Explanations in Partially Annotated Multi-label Classification

2 code implementations • CVPR 2023 • Youngwook Kim, Jae Myung Kim, Jieun Jeong, Cordelia Schmid, Zeynep Akata, Jungwoo Lee

Based on these findings, we propose to boost the attribution scores of the model trained with partial labels to make its explanation resemble that of the model trained with full labels.

Classification Multi-Label Classification

Paper
Code

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

no code implementations • CVPR 2023 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

(ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech).

Automatic Speech Recognition Domain Adaptation +2

Paper
Add Code

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

3 code implementations • CVPR 2023 • Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)

Dense Video Captioning Language Modelling +1

2,982

Paper
Code

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

2 code implementations • 20 Dec 2022 • Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden

One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images.

Multimodal Machine Translation Translation

Paper
Code

REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

1 code implementation • CVPR 2023 • Ziniu Hu, Ahmet Iscen, Chen Sun, ZiRui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, Alireza Fathi

REVEAL consists of four key components: the memory, the encoder, the retriever and the generator.

Ranked #9 on Visual Question Answering (VQA) on OK-VQA

Image Captioning Language Modelling +4

2,982

Paper
Code

Audiovisual Masked Autoencoders

2 code implementations • ICCV 2023 • Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning?

Ranked #1 on Audio Classification on EPIC-KITCHENS-100 (using extra training data)

Audio Classification Representation Learning

2,983

Paper
Code

Location-Aware Self-Supervised Transformers for Semantic Segmentation

1 code implementation • 5 Dec 2022 • Mathilde Caron, Neil Houlsby, Cordelia Schmid

Pixel-level labels are particularly expensive to acquire.

Contrastive Learning Image Classification +2

2,983

Paper
Code

WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

1 code implementation • ICCV 2023 • Guillaume Le Moing, Jean Ponce, Cordelia Schmid

This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones.

SSIM

Paper
Code

AVATAR submission to the Ego4D AV Transcription Challenge

no code implementations • 18 Nov 2022 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022.

Paper
Add Code

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

1 code implementation • 17 Nov 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.

Object Relation

Paper
Code

Learning Reward Functions for Robotic Manipulation by Observing Humans

no code implementations • 16 Nov 2022 • Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid

Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of task-specific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.

Contrastive Learning

Paper
Add Code

A Memory Transformer Network for Incremental Learning

no code implementations • 10 Oct 2022 • Ahmet Iscen, Thomas Bird, Mathilde Caron, Alireza Fathi, Cordelia Schmid

We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.

Class Incremental Learning Incremental Learning

Paper
Add Code

Enforcing the consensus between Trajectory Optimization and Policy Learning for precise robot control

no code implementations • 19 Sep 2022 • Quentin Le Lidec, Wilson Jallet, Ivan Laptev, Cordelia Schmid, Justin Carpentier

Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages.

Reinforcement Learning (RL) valid

Paper
Add Code

Instruction-driven history-aware policies for robotic manipulations

2 code implementations • 11 Sep 2022 • Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.

Ranked #2 on Robot Manipulation on RLBench (Succ. Rate (10 tasks, 100 demos/task) metric)

Robot Manipulation

Paper
Code

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

1 code implementation • 24 Aug 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.

Ranked #1 on Visual Navigation on SOON Test

Language Modelling Navigate +3

Paper
Code

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

no code implementations • 14 Aug 2022 • Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid

In this work, we focus on summarizing instructional videos, an under-explored area of video summarization.

Video Summarization

Paper
Add Code

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

1 code implementation • 26 Jul 2022 • Zerui Chen, Yana Hasson, Cordelia Schmid, Ivan Laptev

We show that such aligned SDFs better focus on reconstructing shape details and improve reconstruction accuracy both for hands and objects.

Ranked #9 on hand-object pose on DexYCB

hand-object pose Object Reconstruction

Paper
Code

Beyond Transfer Learning: Co-finetuning for Action Localisation

no code implementations • 8 Jul 2022 • Anurag Arnab, Xuehan Xiong, Alexey Gritsenko, Rob Romijnders, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid

Transfer learning is the predominant paradigm for training deep networks on small target datasets.

Transfer Learning Video Classification

Paper
Add Code

M&M Mix: A Multimodal Multiview Transformer Ensemble

no code implementations • 20 Jun 2022 • Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.

Ranked #2 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Recognition Video Recognition

Paper
Add Code

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

2 code implementations • 16 Jun 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.

Ranked #1 on Zero-Shot Video Question Answer on TVQA

Fill Mask Language Modelling +6

142

Paper
Code

AVATAR: Unconstrained Audiovisual Speech Recognition

1 code implementation • 15 Jun 2022 • Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

119

Paper
Code

Learning to Answer Visual Questions from Web Videos

1 code implementation • 10 May 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.

Question Answering Question Generation +4

113

Paper
Code

Weakly-supervised segmentation of referring expressions

no code implementations • 10 May 2022 • Robin Strudel, Ivan Laptev, Cordelia Schmid

Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions.

Image Segmentation Referring Expression +5

Paper
Add Code

Assembly Planning from Observations under Physical Constraints

no code implementations • 20 Apr 2022 • Thomas Chabal, Robin Strudel, Etienne Arlaud, Jean Ponce, Cordelia Schmid

This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation.

Object object-detection +2

Paper
Add Code

Learning Audio-Video Modalities from Image Captions

no code implementations • 1 Apr 2022 • Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.

Ranked #6 on Zero-shot Text to Audio Retrieval on AudioCaps

Image Captioning Retrieval +4

Paper
Add Code

TubeDETR: Spatio-Temporal Video Grounding with Transformers

1 code implementation • CVPR 2022 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.

Ranked #2 on Spatio-Temporal Video Grounding on VidSTG

Language-Based Temporal Localization Natural Language Visual Grounding +5

155

Paper
Code

The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

no code implementations • 28 Feb 2022 • Pia Bideau, Erik Learned-Miller, Cordelia Schmid, Karteek Alahari

In this work, we argue that the coupling of camera rotation and camera translation can create complex motion fields that are difficult for a deep network to untangle directly.

Motion Segmentation

Paper
Add Code

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

1 code implementation • CVPR 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.

Ranked #4 on Visual Navigation on SOON Test

Efficient Exploration Navigate +2

Paper
Code

Learning with Neighbor Consistency for Noisy Labels

1 code implementation • CVPR 2022 • Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid

Recent advances in deep learning have relied on large, labelled datasets to train high-capacity models.

Ranked #1 on Image Classification on Red MiniImageNet 80% label noise

Learning with noisy labels

2,983

Paper
Code

End-to-end Generative Pretraining for Multimodal Video Captioning

no code implementations • CVPR 2022 • Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid

Recent video and language pretraining frameworks lack the ability to generate sentences.

Ranked #13 on Video Captioning on MSR-VTT (using extra training data)

Action Classification Retrieval +4

Paper
Add Code

Multiview Transformers for Video Recognition

1 code implementation • CVPR 2022 • Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.

Ranked #5 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Classification Action Recognition +1

2,983

Paper
Code

Masking Modalities for Cross-modal Video Retrieval

no code implementations • 1 Nov 2021 • Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.

Retrieval Video Retrieval

Paper
Add Code

History Aware Multimodal Transformer for Vision-and-Language Navigation

1 code implementation • NeurIPS 2021 • ShiZhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.

Ranked #3 on Vision and Language Navigation on RxR

Decision Making Navigate +2

Paper
Code

Differentiable Rendering with Perturbed Optimizers

no code implementations • NeurIPS 2021 • Quentin Le Lidec, Ivan Laptev, Cordelia Schmid, Justin Carpentier

Notably, images depend both on the properties of observed scenes and on the process of image formation.

3D Scene Reconstruction 6D Pose Estimation

Paper
Add Code

Variational Perturbations for Visual Feature Attribution

no code implementations • 29 Sep 2021 • Jae Myung Kim, Eunji Kim, Sungroh Yoon, Jungwoo Lee, Cordelia Schmid, Zeynep Akata

Explaining a complex black-box system in a post-hoc manner is important to understand its predictions.

Paper
Add Code

Airbert: In-domain Pretraining for Vision-and-Language Navigation

2 code implementations • ICCV 2021 • Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Ranked #3 on Vision and Language Navigation on VLN Challenge

Navigate Referring Expression +1

Paper
Code

Towards unconstrained joint hand-object reconstruction from RGB videos

1 code implementation • 16 Aug 2021 • Yana Hasson, Gül Varol, Ivan Laptev, Cordelia Schmid

Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos.

Ranked #5 on hand-object pose on HO-3D

3D Reconstruction hand-object pose +6

Paper
Code

CCVS: Context-aware Controllable Video Synthesis

1 code implementation • NeurIPS 2021 • Guillaume Le Moing, Jean Ponce, Cordelia Schmid

The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module.

Ranked #8 on Video Generation on BAIR Robot Pushing

Optical Flow Estimation Self-Supervised Learning +2

Paper
Code

Goal-Conditioned Reinforcement Learning with Imagined Subgoals

no code implementations • 1 Jul 2021 • Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning.

reinforcement-learning Reinforcement Learning (RL)

Paper
Add Code

Attention Bottlenecks for Multimodal Fusion

1 code implementation • NeurIPS 2021 • Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.

Ranked #2 on Action Classification on Kinetics-Sounds

Action Classification Action Recognition +2

2,982

Paper
Code

HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

no code implementations • CVPR 2021 • Lu Mi, Hang Zhao, Charlie Nash, Xiaohan Jin, Jiyang Gao, Chen Sun, Cordelia Schmid, Nir Shavit, Yuning Chai, Dragomir Anguelov

To address this issue, we introduce a new challenging task to generate HD maps.

Graph Generation Motion Forecasting +1

Paper
Add Code

Residual Reinforcement Learning from Demonstrations

no code implementations • 15 Jun 2021 • Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid

Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal.

reinforcement-learning Reinforcement Learning (RL)

Paper
Add Code

Large-Scale Unsupervised Object Discovery

1 code implementation • NeurIPS 2021 • Huy V. Vo, Elena Sizikova, Cordelia Schmid, Patrick Pérez, Jean Ponce

Extensive experiments on COCO and OpenImages show that, in the single-object discovery setting where a single prominent object is sought in each image, the proposed LOD (Large-scale Object Discovery) approach is on par with, or better than the state of the art for medium-scale datasets (up to 120K images), and over 37% better than the only other algorithms capable of scaling up to 1. 7M images.

Multi-object discovery Object +2

Paper
Code

Episodic Transformer for Vision-and-Language Navigation

1 code implementation • ICCV 2021 • Alexander Pashevich, Cordelia Schmid, Chen Sun

We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.

Vision and Language Navigation

Paper
Code

Segmenter: Transformer for Semantic Segmentation

7 code implementations • ICCV 2021 • Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

In this paper we introduce Segmenter, a transformer model for semantic segmentation.

Ranked #15 on Semantic Segmentation on PASCAL Context

Image Classification Image Segmentation +3

8,218

Paper
Code

Class-Balanced Distillation for Long-Tailed Visual Recognition

3 code implementations • 12 Apr 2021 • Ahmet Iscen, André Araujo, Boqing Gong, Cordelia Schmid

An effective and simple approach to long-tailed visual recognition is to learn feature representations and a classifier separately, with instance and class-balanced sampling, respectively.

Ranked #11 on Long-tail Learning on iNaturalist 2018

Image Classification Knowledge Distillation +1

32,736

Paper
Code

Local Metrics for Multi-Object Tracking

1 code implementation • 6 Apr 2021 • Jack Valmadre, Alex Bewley, Jonathan Huang, Chen Sun, Cristian Sminchisescu, Cordelia Schmid

This paper introduces temporally local metrics for Multi-Object Tracking.

Multi-Object Tracking Object

Paper
Code

Composable Augmentation Encoding for Video Representation Learning

no code implementations • ICCV 2021 • Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid

We focus on contrastive methods for self-supervised video representation learning.

Action Recognition Contrastive Learning +2

Paper
Add Code

Improving robustness against common corruptions with frequency biased models

no code implementations • ICCV 2021 • Tonmoy Saikia, Cordelia Schmid, Thomas Brox

CNNs perform remarkably well when the training and test distributions are i. i. d, but unseen image corruptions can cause a surprisingly large drop in performance.

Data Augmentation object-detection +1

Paper
Add Code

Unified Graph Structured Models for Video Understanding

no code implementations • ICCV 2021 • Anurag Arnab, Chen Sun, Cordelia Schmid

Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.

Action Detection Graph Classification +3

Paper
Add Code

ViViT: A Video Vision Transformer

8 code implementations • ICCV 2021 • Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.

Ranked #8 on Action Classification on MiT (Top 5 Accuracy metric, using extra training data)

Action Classification Action Recognition +4

2,983

Paper
Code

Learning Temporal Dynamics from Cycles in Narrated Video

no code implementations • ICCV 2021 • Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun

Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community.

Paper
Add Code

Image Matching with Scale Adjustment

no code implementations • 10 Dec 2020 • Yves Dufournaud, Cordelia Schmid, Radu Horaud

In this paper we address the problem of matching two images with two different resolutions: a high-resolution image and a low-resolution one.

Paper
Add Code

Look Before you Speak: Visually Contextualized Utterances

no code implementations • CVPR 2021 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.

Paper
Add Code

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

1 code implementation • ICCV 2021 • Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.

Ranked #1 on Video Question Answering on VideoQA

Question Answering Question Generation +4

113

Paper
Code

Learning Obstacle Representations for Neural Motion Planning

1 code implementation • 25 Aug 2020 • Robin Strudel, Ricardo Garcia, Justin Carpentier, Jean-Paul Laumond, Ivan Laptev, Cordelia Schmid

Motion planning and obstacle avoidance is a key challenge in robotics applications.

Robotics

Paper
Code

TNT: Target-driveN Trajectory Prediction

4 code implementations • 19 Aug 2020 • Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Cong-Cong Li, Dragomir Anguelov

Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states.

Ranked #2 on Trajectory Prediction on INTERACTION Dataset - Validation

Motion Forecasting Trajectory Prediction

459

Paper
Code

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao

This report summarizes the results of the first edition of the challenge together with the findings of the participants.

Natural Language Queries Retrieval +3

327

Paper
Code

Learning Video Representations from Textual Web Supervision

no code implementations • 29 Jul 2020 • Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, David A. Ross

Based on this observation, we propose to use text as a method for learning video representations.

Action Recognition Representation Learning

Paper
Add Code

Multi-modal Transformer for Video Retrieval

1 code implementation • ECCV 2020 • Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.

Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT (text-to-video Mean Rank metric, using extra training data)

Natural Language Queries Retrieval +2

249

Paper
Code

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

no code implementations • ECCV 2020 • Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid

Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind.

Action Detection Action Recognition +2

Paper
Add Code

Unsupervised Learning of Video Representations via Dense Trajectory Clustering

1 code implementation • 28 Jun 2020 • Pavel Tokmakov, Martial Hebert, Cordelia Schmid

This paper addresses the task of unsupervised learning of representations for action recognition in videos.

Action Recognition In Videos Clustering +3

Paper
Code

Consistency Guided Scene Flow Estimation

no code implementations • ECCV 2020 • Yuhua Chen, Luc van Gool, Cordelia Schmid, Cristian Sminchisescu

To handle inherent modeling error in the consistency loss (e. g. Lambertian assumptions) and for better generalization, we further introduce a learned, output refinement network, which takes the initial predictions, the loss, and the gradient as input, and efficiently predicts a correlated output update.

Scene Flow Estimation

Paper
Add Code

TAO: A Large-Scale Benchmark for Tracking Any Object

no code implementations • ECCV 2020 • Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan

To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum.

Multi-Object Tracking Object +2

Paper
Add Code

What Makes for Good Views for Contrastive Learning?

1 code implementation • NeurIPS 2020 • Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.

Ranked #2 on Contrastive Learning on imagenet-1k

Contrastive Learning Data Augmentation +8

1,908

Paper
Code

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

3 code implementations • CVPR 2020 • Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cong-Cong Li, Cordelia Schmid

Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).

Self-Driving Cars

459

Paper
Code

Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction

no code implementations • CVPR 2020 • Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, Cordelia Schmid

Modeling hand-object manipulations is essential for understanding how humans interact with their environment.

Ranked #9 on hand-object pose on HO-3D

hand-object pose Object +3

Paper
Add Code

Learning visual policies for building 3D shape categories

no code implementations • 15 Apr 2020 • Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

We then show the success of our visual policies for building arches from different primitives.

Object

Paper
Add Code

Memory-Efficient Incremental Learning Through Feature Adaptation

no code implementations • ECCV 2020 • Ahmet Iscen, Jeffrey Zhang, Svetlana Lazebnik, Cordelia Schmid

We assume that the model is updated incrementally for new classes as new data becomes available sequentially. This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images.

Incremental Learning

Paper
Add Code

Speech2Action: Cross-modal Supervision for Action Recognition

no code implementations • CVPR 2020 • Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.

Action Recognition

Paper
Add Code

Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification

1 code implementation • ECCV 2020 • Nikita Dvornik, Cordelia Schmid, Julien Mairal

Popular approaches for few-shot classification consist of first learning a generic data representation based on a large annotated dataset, before adapting the representation to new classes given only a few labeled samples.

Ranked #4 on Few-Shot Image Classification on Meta-Dataset Rank

feature selection Few-Shot Image Classification +2

Paper
Code

Beyond the Camera: Neural Networks in World Coordinates

no code implementations • 12 Mar 2020 • Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari

Eye movement and strategic placement of the visual field onto the retina, gives animals increased resolution of the scene and suppresses distracting information.

Action Recognition Video Stabilization +1

Paper
Add Code

Radioactive data: tracing through training

2 code implementations • ICML 2020 • Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

The mark is robust to strong variations such as different architectures or optimization methods.

Data Augmentation Data Poisoning

Paper
Code

Optimized Generic Feature Learning for Few-shot Classification across Domains

no code implementations • 22 Jan 2020 • Tonmoy Saikia, Thomas Brox, Cordelia Schmid

To learn models or features that generalize across tasks and domains is one of the grand goals of machine learning.

BIG-bench Machine Learning Classification +3

Paper
Add Code

Synthetic Humans for Action Recognition from Unseen Viewpoints

1 code implementation • 9 Dec 2019 • Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman

Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.

Action Classification Action Recognition +2

Paper
Code

Learning to Track Any Object

no code implementations • 25 Oct 2019 • Achal Dave, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan

Moreover, at test time the same network can be applied to detection and tracking, resulting in a unified approach for the two tasks.

Instance Segmentation Object +5

Paper
Add Code

Graph convolutional networks for learning with few clean and many noisy labels

1 code implementation • ECCV 2020 • Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Ondrej Chum, Cordelia Schmid

In this work we consider the problem of learning a classifier from noisy labels when a few clean labeled examples are given.

Few-Shot Learning General Classification

Paper
Code

White-box vs Black-box: Bayes Optimal Strategies for Membership Inference

no code implementations • 29 Aug 2019 • Alexandre Sablayrolles, Matthijs Douze, Yann Ollivier, Cordelia Schmid, Hervé Jégou

Membership inference determines, given a sample and trained parameters of a machine learning model, whether the sample was part of the training set.

Paper
Add Code

Learning to combine primitive skills: A step towards versatile robotic manipulation

1 code implementation • 2 Aug 2019 • Robin Strudel, Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Josef Sivic, Cordelia Schmid

Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision.

Data Augmentation Imitation Learning +4

Paper
Code

Moulding Humans: Non-parametric 3D Human Shape Estimation from Single Images

no code implementations • ICCV 2019 • Valentin Gabeur, Jean-Sebastien Franco, Xavier Martin, Cordelia Schmid, Gregory Rogez

In this paper, we tackle the problem of 3D human shape estimation from single RGB images.

3D Human Pose Estimation 3D Human Shape Estimation

Paper
Add Code

Self-supervised Learning with Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera

no code implementations • ICCV 2019 • Yuhua Chen, Cordelia Schmid, Cristian Sminchisescu

We present GLNet, a self-supervised framework for learning depth, optical flow, camera pose and intrinsic parameters from monocular video - addressing the difficulty of acquiring realistic ground-truth for such tasks.

Monocular Depth Estimation Optical Flow Estimation +3

Paper
Add Code

Learning Video Representations using Contrastive Bidirectional Transformer

no code implementations • 13 Jun 2019 • Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid

This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Paper
Add Code

A Study on Action Detection in the Wild

no code implementations • 29 Apr 2019 • Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

In this work we study the problem of action detection in a highly-imbalanced dataset.

Action Detection

Paper
Add Code

Learning joint reconstruction of hands and manipulated objects

3 code implementations • CVPR 2019 • Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, Cordelia Schmid

Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation.

Ranked #7 on hand-object pose on DexYCB

Hand Joint Reconstruction hand-object pose +2

559

Paper
Code

Relational Action Forecasting

no code implementations • CVPR 2019 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, Cordelia Schmid

This paper focuses on multi-person action forecasting in videos.

Action Classification Action Recognition +1

Paper
Add Code

VideoBERT: A Joint Model for Video and Language Representation Learning

3 code implementations • ICCV 2019 • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid

Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.

Ranked #1 on Action Classification on YouCook2

Action Classification General Classification +7

113

Paper
Code

Diversity with Cooperation: Ensemble Methods for Few-Shot Classification

1 code implementation • ICCV 2019 • Nikita Dvornik, Cordelia Schmid, Julien Mairal

Few-shot classification consists of learning a predictive model that is able to effectively adapt to a new class, given only a few annotated samples.

Ranked #13 on Few-Shot Image Classification on Mini-ImageNet - 1-Shot Learning

Classification Few-Shot Image Classification +2

Paper
Code

Learning to Augment Synthetic Images for Sim2Real Policy Transfer

1 code implementation • 18 Mar 2019 • Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

Policies learned in simulators, however, do not transfer well to real scenes given the domain gap between real and synthetic data.

Object Localization

Paper
Code

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

1 code implementation • 5 Jan 2019 • Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru

The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible.

Audio-Visual Active Speaker Detection speaker-diarization +2

Paper
Code

Adaptive Density Estimation for Generative Models

no code implementations • NeurIPS 2019 • Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

We show that our model significantly improves over existing hybrid models: offering GAN-like samples, IS and FID scores that are competitive with fully adversarial models, and improved likelihood scores.

Density Estimation

Paper
Add Code

Detecting unseen visual relations using analogies

no code implementations • ICCV 2019 • Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.

Retrieval

Paper
Add Code

A Structured Model For Action Detection

no code implementations • CVPR 2019 • Yubo Zhang, Pavel Tokmakov, Martial Hebert, Cordelia Schmid

A dominant paradigm for learning-based approaches in computer vision is training generic models, such as ResNet for image recognition, or I3D for video understanding, on large datasets and allowing them to discover the optimal representation for the problem at hand.

Action Detection Video Understanding

Paper
Add Code

Modulated Policy Hierarchies

no code implementations • 30 Nov 2018 • Alexander Pashevich, Danijar Hafner, James Davidson, Rahul Sukthankar, Cordelia Schmid

To achieve this, we study different modulation signals and exploration for hierarchical controllers.

Reinforcement Learning (RL)

Paper
Add Code

Coverage and Quality Driven Training of Generative Image Models

no code implementations • 27 Sep 2018 • Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

First, we propose a model that extends variational autoencoders by using deterministic invertible transformation layers to map samples from the decoder to the image space.

Paper
Add Code

Déjà Vu: an empirical evaluation of the memorization properties of ConvNets

no code implementations • ICLR 2019 • Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting.

Data Augmentation Memorization

Paper
Add Code

On the Importance of Visual Context for Data Augmentation in Scene Understanding

no code implementations • 6 Sep 2018 • Nikita Dvornik, Julien Mairal, Cordelia Schmid

In this work, we consider object detection, semantic and instance segmentation and augment the training images by blending objects in existing scenes, using instance segmentation annotations.

Data Augmentation Instance Segmentation +7

Paper
Add Code

Actor-Centric Relation Network

1 code implementation • ECCV 2018 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid

A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

Ranked #15 on Action Recognition on AVA v2.1

Action Classification Action Detection +5

3,865

Paper
Code

End-to-End Incremental Learning

5 code implementations • ECCV 2018 • Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, Karteek Alahari

Although deep learning approaches have stood out in recent years due to their state-of-the-art results, they continue to suffer from catastrophic forgetting, a dramatic decrease in overall performance when training with new classes added incrementally.

Ranked #2 on Incremental Learning on ImageNet100 - 10 steps (# M Params metric)

Image Classification Incremental Learning

494

Paper
Code

How good is my GAN?

no code implementations • ECCV 2018 • Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

Generative adversarial networks (GANs) are one of the most popular methods for generating images today.

General Classification Image Classification

Paper
Add Code

Modeling Visual Context is Key to Augmenting Object Detection Datasets

2 code implementations • ECCV 2018 • Nikita Dvornik, Julien Mairal, Cordelia Schmid

For this approach to be successful, we show that modeling appropriately the visual context surrounding objects is crucial to place them in the right environment.

Data Augmentation object-detection +1

117

Paper
Code

A flexible model for training action localization with varying levels of supervision

1 code implementation • NeurIPS 2018 • Guilhem Chéron, Jean-Baptiste Alayrac, Ivan Laptev, Cordelia Schmid

Our model is based on discriminative clustering and integrates different types of supervision as constraints on the optimization.

Action Detection Action Localization +1

Paper
Code

Modeling Spatio-Temporal Human Track Structure for Action Localization

no code implementations • 28 Jun 2018 • Guilhem Chéron, Anton Osokin, Ivan Laptev, Cordelia Schmid

In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks.

Human Detection Optical Flow Estimation +3

Paper
Add Code

Spreading vectors for similarity search

2 code implementations • ICLR 2019 • Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Hervé Jégou

Discretizing multi-dimensional data distributions is a fundamental step of modern indexing methods.

Quantization

317

Paper
Code

PoTion: Pose MoTion Representation for Action Recognition

We use the human joints as these keypoints and term our Pose moTion representation PoTion.

Ranked #1 on Skeleton Based Action Recognition on J-HMDB

Action Recognition Skeleton Based Action Recognition +1

Paper
Add Code

Unsupervised Learning of Artistic Styles with Archetypal Style Analysis

no code implementations • NeurIPS 2018 • Daan Wynen, Cordelia Schmid, Julien Mairal

In this paper, we introduce an unsupervised learning approach to automatically discover, summarize, and manipulate artistic styles from large collections of paintings.

Paper
Add Code

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

no code implementations • 25 Apr 2018 • Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68, 536 activity instances in 68. 8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available.

General Classification Video Classification +1

Paper
Add Code

Actor and Observer: Joint Modeling of First and Third-Person Videos

1 code implementation • CVPR 2018 • Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor).

Action Recognition Temporal Action Localization

Paper
Code

BodyNet: Volumetric Inference of 3D Human Body Shapes

2 code implementations • ECCV 2018 • Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, Cordelia Schmid

Human shape estimation is an important task for video editing, animation and fashion industry.

Ranked #3 on 3D Human Pose Estimation on Surreal (using extra training data)

3D Human Pose Estimation Segmentation +1

261

Paper
Code

LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images

no code implementations • 1 Mar 2018 • Gregory Rogez, Philippe Weinzaepfel, Cordelia Schmid

We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images.

3D Human Pose Estimation 3D Multi-Person Pose Estimation (absolute) +1

Paper
Add Code

Image-based Synthesis for Deep 3D Human Pose Estimation

no code implementations • 12 Feb 2018 • Grégory Rogez, Cordelia Schmid

Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.

3D Human Pose Estimation 3D Pose Estimation +1

Paper
Add Code

Learning to Segment Moving Objects

no code implementations • 1 Dec 2017 • Pavel Tokmakov, Cordelia Schmid, Karteek Alahari

We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to correct errors in motion estimation, and (iii) temporal consistency, which imposes additional constraints on the segmentation.

Motion Estimation Motion Segmentation +4

Paper
Add Code

Joint Learning of Object and Action Detectors

no code implementations • ICCV 2017 • Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, Cordelia Schmid

dog and cat jumping, enabling to detect actions of an object without training with these object-actions pairs.

Action Detection Object +1

Paper
Add Code

Incremental Learning of Object Detectors without Catastrophic Forgetting

3 code implementations • ICCV 2017 • Konstantin Shmelkov, Cordelia Schmid, Karteek Alahari

Despite their success for object detection, convolutional neural networks are ill-equipped for incremental learning, i. e., adapting the original model trained on a set of classes to additionally detect objects of new classes, in the absence of the initial training data.

Incremental Learning Object +2

127

Paper
Code

BlitzNet: A Real-Time Deep Network for Scene Understanding

2 code implementations • ICCV 2017 • Nikita Dvornik, Konstantin Shmelkov, Julien Mairal, Cordelia Schmid

Real-time scene understanding has become crucial in many applications such as autonomous driving.

Ranked #2 on Real-Time Object Detection on PASCAL VOC 2007

Autonomous Driving Object +5

310

Paper
Code

Weakly-supervised learning of visual relations

no code implementations • ICCV 2017 • Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

This paper introduces a novel approach for modeling visual relations between pairs of objects.

Ranked #5 on Visual Relationship Detection on VRD Predicate Detection

Clustering Relation +3

Paper
Add Code

Detecting Parts for Action Localization

no code implementations • 19 Jul 2017 • Nicolas Chesneau, Grégory Rogez, Karteek Alahari, Cordelia Schmid

In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i. e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations.

Action Localization

Paper
Add Code

Developing the Path Signature Methodology and its Application to Landmark-based Human Action Recognition

no code implementations • 13 Jul 2017 • Weixin Yang, Terry Lyons, Hao Ni, Cordelia Schmid, Lianwen Jin

To this end, we regard the evolving landmark data as a high-dimensional path and apply non-linear path signature techniques to provide an expressive, robust, non-linear, and interpretable representation for the sequential events.

Action Classification Action Recognition In Videos +1

Paper
Add Code

LCR-Net: Localization-Classification-Regression for Human Pose

no code implementations • CVPR 2017 • Gregory Rogez, Philippe Weinzaepfel, Cordelia Schmid

We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images.

Ranked #4 on 3D Multi-Person Pose Estimation (root-relative) on MuPoTS-3D (MPJPE metric)

3D Human Pose Estimation 3D Multi-Person Pose Estimation (absolute) +4

Paper
Add Code

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

8 code implementations • CVPR 2018 • Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Ranked #6 on Action Detection on UCF101-24

Actin Detection Action Detection +3

76,571

Paper
Code

SCNet: Learning Semantic Correspondence

1 code implementation • ICCV 2017 • Kai Han, Rafael S. Rezende, Bumsub Ham, Kwan-Yee K. Wong, Minsu Cho, Cordelia Schmid, Jean Ponce

This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category.

Semantic correspondence

Paper
Code

Action Tubelet Detector for Spatio-Temporal Action Localization

2 code implementations • ICCV 2017 • Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, Cordelia Schmid

We propose the ACtion Tubelet detector (ACT-detector) that takes as input a sequence of frames and outputs tubelets, i. e., sequences of bounding boxes with associated scores.

Spatio-Temporal Action Localization Temporal Action Localization

104

Paper
Code

SfM-Net: Learning of Structure and Motion from Video

no code implementations • 25 Apr 2017 • Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki

We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations.

Motion Estimation Object +1

Paper
Add Code

Learning Video Object Segmentation with Visual Memory

no code implementations • ICCV 2017 • Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

The module to build a "visual memory" in video, i. e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences.

Ranked #3 on Unsupervised Video Object Segmentation on SegTrack v2

Motion Segmentation Object +3

Paper
Add Code

Proposal Flow: Semantic Correspondences from Object Proposals

no code implementations • 21 Mar 2017 • Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce

Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout.

Object

Paper
Add Code

Learning from Synthetic Humans

2 code implementations • CVPR 2017 • Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, Cordelia Schmid

In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data.

2D Human Pose Estimation 3D Human Pose Estimation +2

576

Paper
Code

Learning Motion Patterns in Videos

no code implementations • CVPR 2017 • Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

The problem of determining whether an object is in motion, irrespective of camera motion, is far from being solved.

Motion Segmentation Optical Flow Estimation +3

Paper
Add Code

Areas of Attention for Image Captioning

no code implementations • ICCV 2017 • Marco Pedersoli, Thomas Lucas, Cordelia Schmid, Jakob Verbeek

We propose "Areas of Attention", a novel attention-based model for automatic image captioning.

Image Captioning Language Modelling

Paper
Add Code

Multi-region two-stream R-CNN for action detection

no code implementations • European Conference on Computer Vision (ECVV 2016) 2016 • Xiaojiang Peng, Cordelia Schmid

We propose a multi-region two-stream R-CNN model for action detection in realistic videos.

Ranked #2 on Action Detection on UCF Sports

Action Recognition Region Proposal +1

Paper
Add Code

MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

no code implementations • NeurIPS 2016 • Grégory Rogez, Cordelia Schmid

Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations.

Ranked #117 on 3D Human Pose Estimation on Human3.6M (PA-MPJPE metric)

3D Human Pose Estimation 3D Pose Estimation +1

Paper
Add Code

Human Action Localization with Sparse Spatial Supervision

no code implementations • 17 May 2016 • Philippe Weinzaepfel, Xavier Martin, Cordelia Schmid

We introduce an approach for spatio-temporal human action localization using sparse spatial supervision.

Action Localization

Paper
Add Code

Long-term Temporal Convolutions for Action Recognition

1 code implementation • 15 Apr 2016 • Gül Varol, Ivan Laptev, Cordelia Schmid

Typical human actions last several seconds and exhibit characteristic spatio-temporal structure.

Ranked #63 on Action Recognition on HMDB-51

Action Recognition Optical Flow Estimation +1

Paper
Code

Weakly-Supervised Semantic Segmentation using Motion Cues

no code implementations • 23 Mar 2016 • Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images.

Image Segmentation Weakly supervised Semantic Segmentation +1

Paper
Add Code

Convolutional Patch Representations for Image Retrieval: an Unsupervised Approach

no code implementations • 1 Mar 2016 • Mattis Paulin, Julien Mairal, Matthijs Douze, Zaid Harchaoui, Florent Perronnin, Cordelia Schmid

Convolutional neural networks (CNNs) have recently received a lot of attention due to their ability to model local stationary structures in natural images in a multi-scale fashion, when learning all model parameters with supervision.

Image Classification Image Retrieval +1

Paper
Add Code

Local Convolutional Features With Unsupervised Training for Image Retrieval

no code implementations • ICCV 2015 • Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Mairal, Florent Perronin, Cordelia Schmid

Patch-level descriptors underlie several important computer vision tasks, such as stereo-matching or content-based image retrieval.

Content-Based Image Retrieval Retrieval +2

Paper
Add Code

Proposal Flow

no code implementations • CVPR 2016 • Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce

Finding image correspondences remains a challenging problem in the presence of intra-class variations and large changes in scene layout.~Semantic flow methods are designed to handle images depicting different instances of the same object or scene category.

Object

Paper
Add Code

Approximate Fisher Kernels of non-iid Image Models for Image Categorization

no code implementations • 3 Oct 2015 • Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

It has been experimentally observed that the performance of BoW and FV representations can be improved by employing discounting transformations such as power normalization.

Image Categorization

Paper
Add Code

Online Object Tracking with Proposal Selection

no code implementations • ICCV 2015 • Yang Hua, Karteek Alahari, Cordelia Schmid

Tracking-by-detection approaches are some of the most successful object trackers in recent years.

Object Visual Object Tracking

Paper
Add Code

Expanded Parts Model for Semantic Description of Humans in Still Images

no code implementations • 14 Sep 2015 • Gaurav Sharma, Frederic Jurie, Cordelia Schmid

We validate our method on three recent challenging datasets of human attributes and actions.

Paper
Add Code

Beat-Event Detection in Action Movie Franchises

no code implementations • 15 Aug 2015 • Danila Potapov, Matthijs Douze, Jerome Revaud, Zaid Harchaoui, Cordelia Schmid

While important advances were recently made towards temporally localizing and recognizing specific human actions or activities in videos, efficient detection and classification of long video chunks belonging to semantically defined categories such as "pursuit" or "romance" remains challenging. We introduce a new dataset, Action Movie Franchises, consisting of a collection of Hollywood action movie franchises.

Classification Event Detection +1

Paper
Add Code

DeepMatching: Hierarchical Deformable Dense Matching

1 code implementation • 25 Jun 2015 • Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid

We introduce a novel matching algorithm, called DeepMatching, to compute dense correspondences between images.

Ranked #4 on Dense Pixel Correspondence Estimation on HPatches

Dense Pixel Correspondence Estimation Optical Flow Estimation

Paper
Code

P-CNN: Pose-based CNN Features for Action Recognition

no code implementations • ICCV 2015 • Guilhem Chéron, Ivan Laptev, Cordelia Schmid

This work targets human action recognition in video.

Action Recognition Temporal Action Localization

Paper
Add Code

Circulant temporal encoding for video retrieval and temporal alignment

1 code implementation • 8 Jun 2015 • Matthijs Douze, Jérôme Revaud, Jakob Verbeek, Hervé Jégou, Cordelia Schmid

We address the problem of specific video event retrieval.

Retrieval Video Retrieval

132

Paper
Code

Learning to track for spatio-temporal action localization

no code implementations • ICCV 2015 • Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid

We present experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB and UCF-101 action localization datasets, where our approach outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.

Spatio-Temporal Action Localization Temporal Action Localization +1

Paper
Add Code

Learning to Detect Motion Boundaries

no code implementations • CVPR 2015 • Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, Cordelia Schmid

We compare the results obtained with several state-of-the-art optical flow approaches and study the impact of the different cues used in the random forest. Furthermore, we introduce a new dataset, the YouTube Motion Boundaries dataset (YMB), that comprises 60 sequences taken from real-world videos with manually annotated motion boundaries.

Boundary Detection Optical Flow Estimation

Paper
Add Code

Weakly-Supervised Alignment of Video With Text

no code implementations • ICCV 2015 • Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid

Given vectorial features for both video and text, we propose to cast this task as a temporal assignment problem, with an implicit linear mapping between the two feature modalities.

Sentence

Paper
Add Code

Unsupervised Object Discovery and Tracking in Video Collections

no code implementations • ICCV 2015 • Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, Cordelia Schmid

This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision.

Object Object Discovery +1

Paper
Add Code

A robust and efficient video representation for action recognition

no code implementations • 21 Apr 2015 • Heng Wang, Dan Oneata, Jakob Verbeek, Cordelia Schmid

We also use the homography to cancel out camera motion from the optical flow.

Action Recognition Homography Estimation +4

Paper
Add Code

Label-Embedding for Image Classification

2 code implementations • 30 Mar 2015 • Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid

Attributes act as intermediate representations that enable parameter sharing between classes, a must when training data is scarce.

Ranked #7 on Multi-label zero-shot learning on Open Images V4

Attribute Classification +4

Paper
Code

Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning

no code implementations • 3 Mar 2015 • Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations.

Multiple Instance Learning Object +2

Paper
Add Code

Unsupervised Object Discovery and Localization in the Wild: Part-based Matching with Bottom-up Region Proposals

no code implementations • CVPR 2015 • Minsu Cho, Suha Kwak, Cordelia Schmid, Jean Ponce

This paper addresses unsupervised discovery and localization of dominant objects from a noisy image collection with multiple object classes.

Object Object Discovery

Paper
Add Code

EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow

no code implementations • CVPR 2015 • Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, Cordelia Schmid

We propose a novel approach for optical flow estimation , targeted at large displacements with significant oc-clusions.

Optical Flow Estimation

Paper
Add Code

Analysing domain shift factors between videos and images for object detection

1 code implementation • 6 Jan 2015 • Vicky Kalogeiton, Vittorio Ferrari, Cordelia Schmid

Object detection is one of the most important challenges in computer vision.

Object object-detection +1

Paper
Code

Weakly Supervised Action Labeling in Videos Under Ordering Constraints

no code implementations • 4 Jul 2014 • Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic

We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script.

Paper
Add Code

Convolutional Kernel Networks

no code implementations • NeurIPS 2014 • Julien Mairal, Piotr Koniusz, Zaid Harchaoui, Cordelia Schmid

An important goal in visual recognition is to devise image representations that are invariant to particular transformations.

Ranked #23 on Image Classification on MNIST

Image Classification

Paper
Add Code

Transformation Pursuit for Image Classification

no code implementations • CVPR 2014 • Mattis Paulin, Jerome Revaud, Zaid Harchaoui, Florent Perronnin, Cordelia Schmid

We propose a principled algorithm Image Transformation Pursuit (ITP) for the automatic selection of a compact set of transformations.

Classification General Classification +1

Paper
Add Code

Multi-fold MIL Training for Weakly Supervised Object Localization

no code implementations • CVPR 2014 • Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid

In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations.

Multiple Instance Learning Object +2

Paper
Add Code

Mixing Body-Part Sequences for Human Pose Estimation

no code implementations • CVPR 2014 • Anoop Cherian, Julien Mairal, Karteek Alahari, Cordelia Schmid

In this paper, we present a method for estimating articulated human poses in videos.

Pose Estimation

Paper
Add Code

Efficient Action Localization with Approximately Normalized Fisher Vectors

no code implementations • CVPR 2014 • Dan Oneata, Jakob Verbeek, Cordelia Schmid

Transformation of the FV by power and L2 normalizations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classification and retrieval tasks.

Action Recognition General Classification +4

Paper
Add Code

Expanded Parts Model for Human Attribute and Action Recognition in Still Images

no code implementations • CVPR 2013 • Gaurav Sharma, Frederic Jurie, Cordelia Schmid

We propose a new model for recognizing human attributes (e. g. wearing a suit, sitting, short hair) and actions (e. g. running, riding a horse) in still images.

Action Recognition In Still Images Attribute

Paper
Add Code

Label-Embedding for Attribute-Based Classification

no code implementations • CVPR 2013 • Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid

The label embedding framework offers other advantages such as the ability to leverage alternative sources of information in addition to attributes (e. g. class hierarchies) or to transition smoothly from zero-shot learning to learning with large quantities of data.

Ranked #5 on Few-Shot Image Classification on CUB-200-2011 - 0-Shot

Attribute Classification +3

Paper
Add Code

Event Retrieval in Large Video Collections with Circulant Temporal Encoding

no code implementations • CVPR 2013 • Jerome Revaud, Matthijs Douze, Cordelia Schmid, Herve Jegou

Furthermore, we extend product quantization to complex vectors in order to compress our descriptors, and to compare them in the compressed domain.

Copy Detection Quantization +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.