Vi-MIX FOR SELF-SUPERVISED VIDEO REPRESENTATION

29 Sep 2021 · Srijan Das, Michael S Ryoo ·

Contrastive representation learning of videos highly rely on exhaustive data aug- mentation strategies. Therefore, towards designing video augmentation for self- supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into an- other video tesseract in the feature space across two different modalities. We find that our video mixing strategy: Vi-Mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the qual- ity of learned video representations. We exhaustively conduct experiments for two downstream tasks: action recognition and video retrieval on three popular video datasets UCF101, HMDB51, and NTU-60. We show that the performance of Vi-Mix on both the downstream tasks is on par with the other self-supervised approaches while requiring less training data.

PDF Abstract