Mesh R-CNN

Learn about the Mesh R-CNN architecture and how to use it to predict 3D models from real-world images.

Overview

Mesh R-CNN is a landmark model in the world of 3D deep learning. It is one of the first computer vision models for 3D shape prediction that works on real-world images. Based on the modular R-CNN design, it relies on much of the same architecture but introduces a new mesh prediction branch with several key innovations. We’ll begin by introducing Mesh R-CNN, then delve into details on the mesh prediction branch, and follow up with some code examples. We lack the time, data, and compute needed to present the entire Mesh R-CNN project, but we’ll explore several code examples that implement the key components.

Introduction to Mesh R-CNN

First introduced in the paper “Mesh R-CNN” in 2020, the Mesh R-CNN architecture builds upon the prior research into the R-CNN models like Faster R-CNN and Mask R-CNN. As a result, pretrained Mask R-CNN models can be used as the backbone to incorporate strong priors into the Mesh R-CNN’s predictions. Mesh R-CNN introduces a mesh prediction branch, which processes image features through a series of branches with the intent of predicting a 3D model for the detected object.

Mesh R-CNN has a number of features that make it both innovative for its time and usable today. Some of these features include:

  • Predicts arbitrary (untextured) 3D meshes from a single image

  • Works on real-world images

  • Can use pretrained backends that have been trained on general computer vision tasks (such as ResNet)

  • Doesn’t require a template mesh

Review of Mask R-CNN

Since Mesh R-CNN relies heavily on Mask R-CNN, we first take a quick review of the Mask R-CNN architecture. Mask R-CNN is built upon the Faster R-CNN architecture for object detection, which consists of two sequential stages:

  1. A region proposal network (RPN) uses a convolutional neural network to propose candidate bounding boxes.

  2. The RoIPool stage aggregates features from the bounding boxes for classification and regression.

Mask R-CNN makes a key contribution called RoIAlign, a technique that enables segmentation mask prediction. It is a variation of another technique in Faster R-CNN, called RoIPool, that is used to pool features from bounding boxes. RoI, or region of interest, refers to gathering relevant information from local regions in an image. Unlike RoIPool, which simply pools this information, RoIAlign applies bilinear interpolation to the underlying feature map to interpolate features for each output point. This enables it to gather local information without losing resolution, which is essential for segmentation tasks.

Press + to interact
Depiction of RoIAlign (dotted grid represents the feature map and solid red grid represents the RoI)
Depiction of RoIAlign (dotted grid represents the feature map and solid red grid represents the RoI)

Next, we take a deeper look at the mesh prediction branch.

Mesh prediction branch

The mesh prediction branch seeks to map the input features from RoIAlign to a 3D triangle mesh that best approximates the shape of the object. From the Mask R-CNN backbone we’re given a collection of image features from the 2D image space.

Press + to interact
The Mesh R-CNN architecture
The Mesh R-CNN architecture

The architecture shows that information from the RoIAlign operation flows into two separate branches: the box/mask branch and the voxel branch. The box/mask branch is identical to that found in Mask R-CNN. Mesh R-CNN introduces the voxel branch to predict meshes. Its duty is to predict a voxel grid, convert it into a mesh, and then refine the predicted mesh for additional detail. We’ll now introduce each of these steps in closer detail.

Voxel prediction branch

The voxel prediction branch is essentially a generalization of the mask prediction branch. Instead of predicting a 2D mask of occupancy estimates, it predicts a 3D voxel grid of occupancy estimates. A small fully convolutional network (FCNA neural network that only consists of convolutions (no fully connected layers), typically used in encoder-decoder networks.) is used to maintain correspondence between image features and voxels.

The following code depicts an example implementation of a voxel prediction branch. This is based off of the official Mesh R-CNN implementation.

Press + to interact
import torch
from torch import nn
from torch.nn import functional as F
class VoxelPredictionLayer(nn.Module):
"""
A voxel head with several conv layers, plus an upsample layer (with `ConvTranspose2d`)
"""
def __init__(
self,
input_channels: int
):
"""
Args:
input_channels: The number of channels in input tensor
"""
super(VoxelPredictionLayer, self).__init__()
# Number of Mesh R-CNN classes
self.num_classes = 9
# Dimensions of conv layers
conv_dims = 256
# The number of convs in the voxel head and the number of channels
num_conv = 4
# The number of depth channels for the predicted voxels
self.num_depth = 28
# Define conv layers
self.conv_norm_relus = []
for k in range(num_conv):
conv = nn.Conv2d(
input_channels if k == 0 else conv_dims,
conv_dims,
kernel_size=3,
stride=1,
padding=1,
bias=True
)
self.add_module("voxel_fcn{}".format(k + 1), conv)
self.conv_norm_relus.append(conv)
# Define deconv layers
self.deconv = nn.ConvTranspose2d(
conv_dims if num_conv > 0 else input_channels,
conv_dims,
kernel_size=2,
stride=2,
padding=0,
)
# Define output (prediction) layer
self.predictor = nn.Conv2d(
conv_dims,
self.num_classes * self.num_depth,
kernel_size=1,
stride=1,
padding=0
)
# Use normal distribution initialization for voxel prediction layer
nn.init.normal_(self.predictor.weight, std=0.001)
if self.predictor.bias is not None:
nn.init.constant_(self.predictor.bias, 0)
def forward(self, x):
# Forward pass through layers
for layer in self.conv_norm_relus:
x = F.relu(layer(x))
# Apply deconv layer
x = F.relu(self.deconv(x))
# Apply output layer
x = self.predictor(x)
# Reshape from (N, CD, H, W) to (N, C, D, H, W)
x = x.reshape(x.size(0), self.num_classes, self.num_depth, x.size(2), x.size(3))
return x
# Create an instance of a voxel prediction layer
voxel_head = VoxelPredictionLayer(4)
print(voxel_head)
# Test the forward pass
x = torch.randn([1, 4, 8, 8])
output = voxel_head(x)
print(output.shape)

The main objective of this layer is to convert the representation from a 2D grid into a 3D grid. This is implemented in a straightforward fashion with a single reshape function. Otherwise, this layer is a simple FCN consisting of the Conv2d and ConvTranspose2d layers. The ConvTranspose2d layer increases the output resolution of the branch to provide more voxels to work with.

Keep in mind the effect of the camera frustum. The 3D area that each pixel covers increases with distance from the camera. Since the camera frustum effect distorts the shape of voxels, the voxel space is transformed by the camera intrinsic matrix KK to accommodate the frustum effect. That means that Mask R-CNN must either know or assume the camera intrinsics.

Voxel2mesh conversion

The voxel2mesh conversion step starts as a necessary bridge between the voxel prediction and mesh refinement branches. After all, to refine a mesh we first need a mesh. Given the output voxel grid of densities from the voxel prediction branch, the voxel2mesh stage simply applies the familiar cubifyAn operation that converts a voxel grid into a 3D mesh. operator.

Recall, that the cubify operator takes a grid of input voxels with occupancy values σ\sigma ranging in [0,1][0, 1] and applies an input threshold tt. Voxels with ...