Contrastive Multiview Coding

ICLR 2020 Yonglong TianDilip KrishnanPhillip Isola

Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a "dog" can be seen, heard, and felt)... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT LEADERBOARD
Self-Supervised Image Classification ImageNet CMC (ResNet-50 x2) Top 1 Accuracy 70.6% # 17
Top 5 Accuracy 89.7% # 12
Number of Params 188M # 6
Self-Supervised Image Classification ImageNet CMC (ResNet-101)-deprecated Top 1 Accuracy 65.0% # 26
Top 5 Accuracy 86.0% # 15
Self-Supervised Image Classification ImageNet CMC (ResNet-50) Top 1 Accuracy 66.2% # 22
Top 5 Accuracy 87.0% # 14
Number of Params 47M # 10
Self-Supervised Action Recognition UCF101 Contrastive Multiview Coding (CaffeNet x2) 3-fold Accuracy 59.1 # 11

Methods used in the Paper