MARLIN: Masked Autoencoder for facial video Representation LearnINg

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN .

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Action Classification CelebV-HQ MARLIN Accuracy 95.48 # 1
AUC 0.9406 # 1
Facial Attribute Classification CelebV-HQ MARLIN Accuracy 93.9 # 1
AUC 0.9561 # 1
Emotion Classification CMU-MOSEI MARLIN (ViT-S) Accuracy 80.38 # 3
Multimodal Sentiment Analysis CMU-MOSEI MARLIN (ViT-L) Accuracy 74.83 # 12
Multimodal Sentiment Analysis CMU-MOSEI MARLIN (ViT-B) Accuracy 73.7 # 13
Multimodal Sentiment Analysis CMU-MOSEI MARLIN (ViT-S) Accuracy 72.69 # 14
Emotion Classification CMU-MOSEI MARLIN (ViT-L) Accuracy 80.63 # 1
Emotion Classification CMU-MOSEI MARLIN (ViT-B) Accuracy 80.6 # 2
DeepFake Detection FaceForensics++ MARLIN (ViT-L) AUC 0.9377 # 2
DeepFake Detection FaceForensics++ MARLIN (ViT-S) AUC 0.8863 # 4
DeepFake Detection FaceForensics++ MARLIN (ViT-B) AUC 0.9305 # 3
Unconstrained Lip-synchronization LRS2 Wav2Lip + ViT + MARLIN LSE-D 7.127 # 1
LSE-C 5.528 # 2
FID 3.452 # 1

Methods


MAE โ€ข MARLIN