Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization

3 May 2023  ·  Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, Munawar Hayat ·

Most deepfake detection methods focus on detecting spatial and/or spatio-temporal changes in facial attributes and are centered around the binary classification task of detecting whether a video is real or fake. This is because available benchmark datasets contain mostly visual-only modifications present in the entirety of the video. However, a sophisticated deepfake may include small segments of audio or audio-visual manipulations that can completely change the meaning of the video content. To addresses this gap, we propose and benchmark a new dataset, Localized Audio Visual DeepFake (LAV-DF), consisting of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture which effectively captures multimodal manipulations. We further improve (i.e. BA-TFD+) the baseline method by replacing the backbone with a Multiscale Vision Transformer and guide the training process with contrastive, frame classification, boundary matching and multimodal boundary matching loss functions. The quantitative analysis demonstrates the superiority of BA-TFD+ on temporal forgery localization and deepfake detection tasks using several benchmark datasets including our newly proposed dataset. The dataset, models and code are available at https://github.com/ControlNet/LAV-DF.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Temporal Forgery Localization ForgeryNet BA-TFD+ AP@0.5 93.13 # 1
AP@0.75 89.14 # 1
AP@0.95 81.09 # 1
AR@5 95.69 # 1
AR@2 90.63 # 1
Temporal Forgery Localization LAV-DF BA-TFD+ AR@100 81.62 # 2
AR@50 80.48 # 2
AR@20 79.4 # 2
AR@10 78.75 # 2
AP@0.5 96.3 # 2
AP@0.75 84.96 # 2
AP@0.95 4.44 # 2

Methods