Search Results for author: Ivan Laptev

Found 87 papers, 49 papers with code

Learning Actionness via Long-range Temporal Order Verification

no code implementations ECCV 2020 Dimitri Zhukov, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic

The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background.

Action Recognition Temporal Action Localization

SUGAR: Pre-training 3D Visual Representations for Robotics

no code implementations1 Apr 2024 ShiZhe Chen, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks, namely cross-modal knowledge distillation for semantic learning, masked point modeling to understand geometry structures, grasping pose synthesis for object affordance, 3D instance segmentation and referring expression grounding to analyze cluttered scenes.

3D Instance Segmentation 3D Object Recognition +5

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

1 code implementation12 Dec 2023 Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, Josef Sivic

We address the task of generating temporally consistent and physically plausible images of actions and object state transformations.

Object

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

1 code implementation27 Sep 2023 ShiZhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev

The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics.

Multi-Task Learning Robot Manipulation

VidChapters-7M: Video Chapters at Scale

no code implementations NeurIPS 2023 Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid

To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.

Dense Video Captioning Navigate

Object Goal Navigation with Recursive Implicit Maps

no code implementations10 Aug 2023 ShiZhe Chen, Thomas Chabal, Ivan Laptev, Cordelia Schmid

Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments.

Navigate Object

Robust Visual Sim-to-Real Transfer for Robotic Manipulation

no code implementations28 Jul 2023 Ricardo Garcia, Robin Strudel, ShiZhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks.

object-detection Object Detection +1

Learning Video-Conditioned Policies for Unseen Manipulation Tasks

no code implementations10 May 2023 Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled robot trajectories and corresponding robot videos.

Action Recognition Robot Manipulation +1

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

3 code implementations CVPR 2023 Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

 Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)

Dense Video Captioning Language Modelling +1

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

2 code implementations20 Dec 2022 Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, Rachel Bawden

One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as images.

Multimodal Machine Translation Translation

Multi-Task Learning of Object State Changes from Uncurated Videos

1 code implementation24 Nov 2022 Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos.

Multi-Task Learning Object +2

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

1 code implementation17 Nov 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.

Object Relation

Instruction-driven history-aware policies for robotic manipulations

2 code implementations11 Sep 2022 Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid

In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.

Ranked #2 on Robot Manipulation on RLBench (Succ. Rate (10 tasks, 100 demos/task) metric)

Robot Manipulation

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

1 code implementation24 Aug 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.

Language Modelling Navigate +3

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

1 code implementation26 Jul 2022 Zerui Chen, Yana Hasson, Cordelia Schmid, Ivan Laptev

We show that such aligned SDFs better focus on reconstructing shape details and improve reconstruction accuracy both for hands and objects.

hand-object pose Object Reconstruction

Weakly-supervised segmentation of referring expressions

no code implementations10 May 2022 Robin Strudel, Ivan Laptev, Cordelia Schmid

Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions.

Image Segmentation Referring Expression +5

Learning to Answer Visual Questions from Web Videos

1 code implementation10 May 2022 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i. e., videos with alt-text annotations, and show its benefits for training VideoQA models.

Question Answering Question Generation +4

Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

1 code implementation CVPR 2022 Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, Josef Sivic

In this paper, we seek to temporally localize object states (e. g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision.

Object

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

1 code implementation CVPR 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.

Efficient Exploration Navigate +2

Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

no code implementations20 Dec 2021 Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, Edouard Grave

Our study shows that denoising autoencoders, such as BEiT or a variant that we introduce in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings. We obtain competitive performance compared to ImageNet pre-training on a variety of classification datasets, from different domains.

Denoising Instance Segmentation +1

Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

no code implementations2 Nov 2021 Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of the interactions.

Human-Object Interaction Detection Object

Reconstructing and grounding narrated instructional videos in 3D

no code implementations9 Sep 2021 Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys

Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product.

3D Reconstruction

Airbert: In-domain Pretraining for Vision-and-Language Navigation

2 code implementations ICCV 2021 Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Navigate Referring Expression +1

Goal-Conditioned Reinforcement Learning with Imagined Subgoals

no code implementations1 Jul 2021 Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning.

reinforcement-learning Reinforcement Learning (RL)

XCiT: Cross-Covariance Image Transformers

11 code implementations NeurIPS 2021 Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou

We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries.

Instance Segmentation object-detection +3

Training Vision Transformers for Image Retrieval

1 code implementation10 Feb 2021 Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Hervé Jégou

Transformers have shown outstanding results for natural language understanding and, more recently, for image classification.

Image Classification Image Retrieval +3

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

1 code implementation ICCV 2021 Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision.

Question Answering Question Generation +4

Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

1 code implementation13 Nov 2020 Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic

We evaluate our method on simple single- and two-object actions from the Something-Something dataset.

Object

Learning Obstacle Representations for Neural Motion Planning

1 code implementation25 Aug 2020 Robin Strudel, Ricardo Garcia, Justin Carpentier, Jean-Paul Laumond, Ivan Laptev, Cordelia Schmid

Motion planning and obstacle avoidance is a key challenge in robotics applications.

Robotics

RareAct: A video dataset of unusual interactions

1 code implementation3 Aug 2020 Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".

Action Recognition

Occlusion resistant learning of intuitive physics from videos

no code implementations30 Apr 2020 Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux

In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions.

Object

Learning visual policies for building 3D shape categories

no code implementations15 Apr 2020 Alexander Pashevich, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

We then show the success of our visual policies for building arches from different primitives.

Object

Learning Interactions and Relationships between Movie Characters

1 code implementation CVPR 2020 Anna Kukleva, Makarand Tapaswi, Ivan Laptev

Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels.

Action Modifiers: Learning from Adverbs in Instructional Videos

1 code implementation CVPR 2020 Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, Dima Damen

We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations.

Video-Adverb Retrieval

Synthetic Humans for Action Recognition from Unseen Viewpoints

1 code implementation9 Dec 2019 Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman

Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.

Action Classification Action Recognition +2

Monte-Carlo Tree Search for Efficient Visually Guided Rearrangement Planning

2 code implementations23 Apr 2019 Yann Labbé, Sergey Zagoruyko, Igor Kalevatykh, Ivan Laptev, Justin Carpentier, Mathieu Aubry, Josef Sivic

We address the problem of visually guided rearrangement planning with many movable objects, i. e., finding a sequence of actions to move a set of objects from an initial arrangement to a desired one, while relying on visual inputs coming from an RGB camera.

Deep Metric Learning Beyond Binary Supervision

1 code implementation CVPR 2019 Sungyeon Kim, Minkyo Seo, Ivan Laptev, Minsu Cho, Suha Kwak

Metric Learning for visual similarity has mostly adopted binary supervision indicating whether a pair of images are of the same class or not.

Image Captioning Image Retrieval +4

Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video

1 code implementation CVPR 2019 Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of their interactions.

Object

Cross-task weakly supervised learning from instructional videos

2 code implementations CVPR 2019 Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic

In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations.

Weakly-supervised Learning

Learning to Augment Synthetic Images for Sim2Real Policy Transfer

1 code implementation18 Mar 2019 Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, Cordelia Schmid

Policies learned in simulators, however, do not transfer well to real scenes given the domain gap between real and synthetic data.

Object Localization

Detecting unseen visual relations using analogies

no code implementations ICCV 2019 Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

We seek to detect visual relations in images of the form of triplets t = (subject, predicate, object), such as "person riding dog", where training examples of the individual entities are available but their combinations are unseen at training.

Retrieval

Tube-CNN: Modeling temporal evolution of appearance for object detection in video

no code implementations6 Dec 2018 Tuan-Hung Vu, Anton Osokin, Ivan Laptev

Our goal in this paper is to learn discriminative models for the temporal evolution of object appearance and to use such models for object detection.

Object object-detection +2

Learning to Localize and Align Fine-Grained Actions to Sparse Instructions

no code implementations22 Sep 2018 Meera Hahn, Nataniel Ruiz, Jean-Baptiste Alayrac, Ivan Laptev, James M. Rehg

Automatic generation of textual video descriptions that are time-aligned with video content is a long-standing goal in computer vision.

Object Object Recognition

Modeling Spatio-Temporal Human Track Structure for Action Localization

no code implementations28 Jun 2018 Guilhem Chéron, Anton Osokin, Ivan Laptev, Cordelia Schmid

In order to localize actions in time, we propose a recurrent localization network (RecLNet) designed to model the temporal structure of actions on the level of person tracks.

Human Detection Optical Flow Estimation +3

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

5 code implementations7 Apr 2018 Antoine Miech, Ivan Laptev, Josef Sivic

We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.

Ranked #33 on Video Retrieval on LSMDC (using extra training data)

Retrieval Text Retrieval +2

Learnable pooling with Context Gating for video classification

5 code implementations21 Jun 2017 Antoine Miech, Ivan Laptev, Josef Sivic

In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

Classification Clustering +3

Joint Discovery of Object States and Manipulation Actions

1 code implementation ICCV 2017 Jean-Baptiste Alayrac, Josev Sivic, Ivan Laptev, Simon Lacoste-Julien

We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision.

Action Recognition Clustering +2

Learning from Synthetic Humans

2 code implementations CVPR 2017 Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, Cordelia Schmid

In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data.

2D Human Pose Estimation 3D Human Pose Estimation +2

Much Ado About Time: Exhaustive Annotation of Temporal Data

no code implementations25 Jul 2016 Gunnar A. Sigurdsson, Olga Russakovsky, Ali Farhadi, Ivan Laptev, Abhinav Gupta

We conclude that the optimal strategy is to ask as many questions as possible in a HIT (up to 52 binary questions after watching a 30-second video clip in our experiments).

Thin-Slicing for Pose: Learning to Understand Pose Without Explicit Pose Estimation

no code implementations CVPR 2016 Suha Kwak, Minsu Cho, Ivan Laptev

We address the problem of learning a pose-aware, compact embedding that projects images with similar human poses to be placed close-by in the embedding space.

Action Recognition Image Retrieval +3

The THUMOS Challenge on Action Recognition for Videos "in the Wild"

no code implementations21 Apr 2016 Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, Mubarak Shah

Additionally, we include a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimmed videos, and how well methods trained on trimmed videos generalize to untrimmed videos.

Action Classification Action Recognition +3

Context-aware CNNs for person head detection

1 code implementation ICCV 2015 Tuan-Hung Vu, Anton Osokin, Ivan Laptev

First, we leverage person-scene relations and propose a Global CNN model trained to predict positions and scales of heads directly from the full image.

Face Detection Head Detection +1

Unsupervised Learning from Narrated Instruction Videos

no code implementations CVPR 2016 Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien

Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.

Clustering

Weakly-Supervised Alignment of Video With Text

no code implementations ICCV 2015 Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid

Given vectorial features for both video and text, we propose to cast this task as a temporal assignment problem, with an implicit linear mapping between the two feature modalities.

Sentence

Unsupervised Object Discovery and Tracking in Video Collections

no code implementations ICCV 2015 Suha Kwak, Minsu Cho, Ivan Laptev, Jean Ponce, Cordelia Schmid

This paper addresses the problem of automatically localizing dominant objects as spatio-temporal tubes in a noisy collection of videos with minimal or even no supervision.

Object Object Discovery +1

Weakly Supervised Action Labeling in Videos Under Ordering Constraints

no code implementations4 Jul 2014 Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic

We are given a set of video clips, each one annotated with an {\em ordered} list of actions, such as "walk" then "sit" then "answer phone" extracted from, for example, the associated text script.

Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks

1 code implementation CVPR 2014 Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic

We show that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets.

Action Classification Action Localization +4

Learning person-object interactions for action recognition in still images

no code implementations NeurIPS 2011 Vincent Delaitre, Josef Sivic, Ivan Laptev

First, we replace the standard quantized local HOG/SIFT features with stronger discriminatively trained body part and object detectors.

Action Recognition In Still Images Object

Cannot find the paper you are looking for? You can Submit a new open access paper.