Reenactment Using pix2pix

Learn about the process of data preparation and how to train pix2pix to generate fake images.

Reenactment is another mode of operation for the deepfakes setup. It is supposedly better at generating believable fake content than the replacement mode. In earlier sections, we discussed different techniques used to perform reenactment, such as focusing on gaze, expressions, the mouth, and so on. We also discussed image-to-image translation architectures.

In this section, we'll leverage the pix2pix GAN to develop a face re-enactment setup from scratch. We'll work toward building a network where we can use our own face, mouth, and expressions to control Barack Obama’s (former US president) face. We'll go through each step, starting with preparing the dataset, defining the pix2pix architecture, and finally generating the output re-enactment. Let’s get started.

Dataset preparation

We'll use the pix2pix GAN as the backbone network for our current task of reenactment. While pix2pix is a powerful network that trains with very few training samples, a restriction requires the training samples to be paired. In this section, we'll use this restriction to our advantage.

Since the aim is to analyze a target face and control it using a source face, we can leverage what is common between faces to develop a dataset for our use case. The common characteristics between different faces are the presence of facial landmarks and their positioning. In the Key Feature Set lesson, we discussed how simple and easy it is to build a facial landmark detection module using libraries such as dlib, cv2, and MTCNN.

For our current use case, we’ll prepare paired training samples consisting of pairs of landmarks and their corresponding images/photographs. For generating reenacted content, we can then simply extract facial landmarks of the source face/controlling entity and use pix2pix to generate high-quality actual output based on the target person. In our case, the source/controlling personality could be you or any other person, while the target personality is Barack Obama.

To prepare our dataset, we'll extract frames and corresponding landmarks of each frame from a video. Since we want to train our network to be able to generate high-quality colored output images based on landmark inputs, we need a video of Barack Obama. You could download this from various different sources on the internet.

This exercise is again for academic and educational purposes only. Kindly use any videos carefully and with caution.

Generating a paired dataset of landmark and video frames is a straightforward application of the code snippets given in the Facial landmarks section. To avoid repetition, we leave this as an exercise for the reader. Please note that the complete code is available in the code repository for this course. We generated close to 400 paired samples from one of the speeches of Barack Obama. The figure below presents a few of these samples:

Get hands-on with 1400+ tech skills courses.