With the proposed inter-intra contrastive framework, we can train spatio-temporal convolutional networks to learn video representations.
To further enhance the model capacity and testify the robustness of the proposed architecture on difficult transfer tasks, we extend our model to work in a semi-supervised setting using an additional video-level bipartite graph.
This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales.
Ranked #6 on Semantic Segmentation on ADE20K val
Egocentric gestures are the most natural form of communication for humans to interact with wearable devices such as VR/AR helmets and glasses.
Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy.
We successfully train a 404-layer deep CNN on the ImageNet dataset and a 3002-layer network on CIFAR-10 and CIFAR-100, while the baseline is not able to converge at such extreme depths.
We propose the use of a universal adversarial trigger as the backdoor trigger to attack video recognition models, a situation where backdoor attacks are likely to be challenged by the above 4 strict conditions.
In this work we present a manipulation scheme for fooling video classifiers by introducing a flickering temporal perturbation that is practically unnoticeable by human observers and is implementable in the real world.
To overcome this challenge, we propose a heuristic black-box attack model that generates adversarial perturbations only on the selected frames and regions.
In this work, we argue that aggregating features in the full-sequence level will lead to more discriminative and robust features for video object detection.
Ranked #2 on Video Object Detection on ImageNet VID