...

/

Using Pixel-Wise Labels to Translate Images with pix2pix

Using Pixel-Wise Labels to Translate Images with pix2pix

Explore how to use pixel-wise labels to translate images with the pix2pix model.

Labels can be assigned to specific pixels, which are known as pixel-wise labels. Pixel-wise labels are playing an increasingly important role in the realm of deep learning. For example, one of the most famous online image classification contests, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), is no longer being hosted since its last event in 2017, whereas object detection and segmentation challenges such as COCO are receiving more attention.

Press + to interact
How object detection works
How object detection works

Semantic segmentation

An iconic application of pixel-wise labeling is semantic segmentation. Semantic segmentation (or image/object segmentation) is a task in which every pixel in the image must belong to one object. The most promising application of semantic segmentation is autonomous cars (or self-driving cars). If each and every pixel that’s captured by the camera that’s mounted on the self-driving car is correctly classified, all of the objects in the image will be easily recognized, which makes it much easier for the vehicle to properly analyze the current environment and make the right decision upon whether it should, for example, turn or slow down to avoid other vehicles and pedestrians.

Press + to interact

Transforming the original color image into a segmentation map (as shown in the following diagram) can be considered an image-to-image translation problem, which is a much larger field and includes style transfer, image colorization, and more. Image style transfer is about moving the iconic textures and colors from one image to another, such as combining a photo with a Vincent van Gogh painting to create a unique artistic portrait. Image colorization is a task where we feed a 1-channel grayscale image to the model and let it predict the color information for each pixel, which leads to a 3-channel color image.

GANs can be used in image-to-image translation as well. In this section, we will use a classic image-to-image translation model, pix2pix, to transform images from one domain to another. Pix2pixIsola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125-1134. 2017. was designed to learn of the connections between paired collections of images, for example, transforming an aerial photo taken by a satellite into a regular map or a sketch image into a color image, and vice versa.

The authors of the paper have kindly provided the full source codehttps://github.com/junyanz/pytorch-CycleGAN-and-pix2pix.git for their work, which runs perfectly on PyTorch. The source code is also well organized. Therefore, we will use the code directly in order to train and evaluate the pix2pix model and learn how to organize our models in a different way.

Generator architecture

The architecture of the generator network of pix2pix is as follows:

Press + to interact
Generator architecture of pix2pix
Generator architecture of pix2pix

Here, we assume that both the input and output data are 3-channel 256×256256\times256 images. In order to illustrate the generator structure of pix2pix, feature maps are represented by colored blocks and convolution operations are represented by gray and blue arrows, in which gray arrows are convolution layers for reducing the feature map sizes and blue arrows are for doubling the feature map sizes. Identity mapping (including skip connections) is represented by black arrows.

We can see that the first half layers of this network gradually transform the input image into 1×11\times1 feature maps (with wider channels) and the last half layers transform these very small feature maps into an output image with the same size of the input image. It compresses the input data into much lower dimensions and changes them back to their original dimensions. Therefore, this U-shaped kind of network structure is often known as U-Net. There are also many skip connections in the U-Net that connect the mirrored layers in order to help information (including details coming from previous layers in the forward pass and gradients coming from the latter layers in the backward pass) ...