The Deep-Fake Fingerprint: Verifying the Authenticity of Audio-Visual Content
With the massive explosion of online multimedia content and the growth of artificial intelligence methods to synthesize and manipulate media comes a breadth of information security risks that pertain to the authenticity and veracity of audio-visual information. In particular, the proliferation of synthetic media or deep-fakes that replace a person's face or voice in a video in ways that are increasingly becoming imperceptible to humans raises serious concerns for our society and democracy. Consequently, a wealth of literature has emerged in the media forensics space proposing techniques for detecting deep-faked media content.
Computational detection methods for deep-fakes are broadly classified into artifact-based and identity-based techniques. Prior work focuses primarily on unimodal methods for deepfake detection by examining only the audio or the visual elements of deep-faked content. For example, computer vision researchers have used advances in Convolutional Neural Networks (CNNs) to detect framewise inconsistencies and artifacts in image or video content. Audio researchers have used learnings from speaker identification and verification and typically decompose the audio signal into Mel-Spectrograms or Mel-Frequency Cepstral Coefficients (MFCCs) that are used as features in classification models. More recent work has shown that identity-based deepfake detection that does not exploit artifacts generated by specific deepfake generation methods tends to be more robust and generalize better.
Our project aims to build on prior research by creating a unique identity fingerprint of a person and authenticating the media content they are featured in by combining audio signal processing and computer vision techniques. The identity fingerprint will include hand-crafted features that characterize the person's idiosyncratic facial, gestural, and vocal mannerisms. Our model can then compare a test fingerprint generated from an in-the-wild deepfake video of the person with the identity fingerprint of the person using an appropriate distance metric during the detection phase.
The core deliverable for this project will be a research paper with the aim to publish in an academic journal such as the IEEE Transactions on Signal Processing, the Journal of Online Trust and Safety, or a CVPR workshop. We expect this work to have a positive impact by increasing the trust and safety of multimedia content.
Illustration by Max Fleishman, via Daily Dot.