Human Activity Recognition is the problem of identifying events performed by humans given a video input. It is formulated as a binary (or multiclass) classification problem of outputting activity class labels. Activity Recognition is an important problem with many societal applications including smart surveillance, video search/retrieval, intelligent robots, and other monitoring systems.
We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.
Ranked #1 on Action Classification on Charades
Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning?
Ranked #1 on Egocentric Activity Recognition on EPIC-Kitchens (Actions Top-1 (S2) metric)
Human activity recognition (HAR) has become a popular topic in research because of its wide application.
Temporal relational reasoning, the ability to link meaningful transformations of objects or entities over time, is a fundamental property of intelligent species.
Ranked #2 on Hand Gesture Recognition on Jester test
We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.
Ranked #18 on Action Recognition on UCF101
In this context, deep learning has emerged in recent years as one of the most effective methods for tackling the supervised classification task, particularly in the field of computer vision.
However, by exploiting a simple technique that removes motion information, we show that it is not the case that this technique is effective as-is for representing relevance in non-image tasks.
Research on depth-based human activity analysis achieved outstanding performance and demonstrated the effectiveness of 3D representation for action recognition.
Ranked #2 on One-Shot 3D Action Recognition on NTU RGB+D 120
To avoid annotating the anomalous segments or clips in training videos, which is very time consuming, we propose to learn anomaly through the deep multiple instance ranking framework by leveraging weakly labeled training videos, i. e. the training labels (anomalous or normal) are at video-level instead of clip-level.