Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy.
Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies.
#34 best model for Image Classification on ImageNet (using extra training data)
The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.
The accuracy of detection suffers from degenerated object appearances in videos, e. g., motion blur, video defocus, rare poses, etc.
#4 best model for Video Object Detection on ImageNet VID
The micro-batch training setting is hard because small batch sizes are not enough for training networks with Batch Normalization (BN), while other normalization methods that do not rely on batch knowledge still have difficulty matching the performances of BN in large-batch training.
#12 best model for Instance Segmentation on COCO minival
We successfully train a 404-layer deep CNN on the ImageNet dataset and a 3002-layer network on CIFAR-10 and CIFAR-100, while the baseline is not able to converge at such extreme depths.
Recent years have seen tremendous progress in still-image segmentation; however the na\"ive application of these state-of-the-art algorithms to every video frame requires considerable computation and ignores the temporal continuity inherent in video.
This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales.
#5 best model for Semantic Segmentation on ADE20K