Masks for image segmentation are created by assigning a label to each pixel in an image, indicating which object or region it belongs to. In practice, this is done by annotating images with binary masks or using tools like LabelMe or VGG Image Annotator. These masks are then used to train segmentation models like U-Net or Mask R-CNN.
How to perform instance segmentation using Mask R-CNN
Key takeaways:
Mask R-CNN is a sophisticated computer vision model that integrates object detection and semantic segmentation which enable precise identification and pixel-level segmentation of objects in images.
Mask R-CNN component extracts features from the input image, generating feature maps that capture essential details such as edges and textures. Popular architectures include ResNet and feature pyramid network.
Mask R-CNN predicts segmentation masks for objects in each RoI, providing detailed outlines of object shapes, enhancing traditional detection methods.
Mask R-CNN (Mask Region-based Convolutional Neural Network) is an advanced computer vision model that combines object detection and semantic segmentation. It enables it to identify objects within an image and precisely segment them at the pixel level.
Mask R-CNN components
Mask R-CNN comprises of several essential components to achieve object detection and instance segmentation. The combination of components allows Mask R-CNN to achieve state-of-the-art results in various computer vision tasks. These components include:
Backbone network
The backbone network is responsible for feature extraction. It takes the entire image as input and generates feature maps that capture critical information for subsequent tasks. These features include edges, corners, shapes, and textures. Popular backbone network architectures, like ResNet and
Region Proposal Network (RPN)
The feature maps produced by the Backbone Network are used by the Region Proposal Network (RPN). RPN is responsible for proposing potential object regions within an image. Instead of scanning the input image, RPN scans the feature maps to find regions. These regions are called region of interests (RoIs), represented as rectangular bounding boxes. The box is accompanied by a confidence score, which indicates the likelihood of containing an object.
RoIAlign
RoIAlign extracts features from RoIs within the feature maps generated by the backbone network. It transforms the irregularly shaped RoIs into fixed-sized feature maps, which can then be used for subsequent classification and mask prediction. RoIAlign addresses the need for precise spatial alignment, which is essential for tasks like object detection and instance segmentation.
Object detection head
The detection head identifies and localizes objects in the images. This component is responsible for classifying objects and predicting bounding boxes. It takes the fixed-sized feature maps from RoIs and performs object classification. Then it assigns a class label to each proposed object region, that determines what the object is.
Instance segmentation
Mask R-CNN accurately outlines the shape of objects within those regions by predicting the pixel-level segmentation masks for each RoI. This enables instance segmentation, which goes beyond traditional object detection by providing detailed object segmentation. Mask R-CNN predicts segmentation masks for objects within each RoI in parallel with the existing classification and bounding box regression.
Implementing Mask R-CNN in Python
When operating with TensorFlow version 2.13, we modified deprecated functions, such as tf.log with tf.math.log and migrated from the keras.engine module to the keras.layers module. This ensures compatibility and adherence to the latest practices in TensorFlow.
We configure the Mask R-CNN model using a custom configuration. We define the parameters the number of GPUs, images per GPU, and the number of classes. The rest of the parameters VALIDATION_STEPS, STEPS_PER_EPOCH are from the base class Mask_RCNN/mrcnn/config.py.
from mrcnn.config import Configclass MaskRCNN_config(Config):# the configuration nameNAME = 'MaskRCNN_config_inference'# Number OF GPUs to use. When using only a CPU, this needs to be set to 1.GPU_COUNT = 1# Number of images to train with on each GPUIMAGES_PER_GPU = 1# number of classes (including background)NUM_CLASSES = 81config = MaskRCNN_config()
We initialize the model for inference, indicating that it will be used to make predictions on new, unseen data. We pass the configuration and set up the current directory where the model files and weights will be stored.
from mrcnn import model as model_libprint('loading weights for Mask R-CNN model…')model = model_lib.MaskRCNN(mode='inference', config=config, model_dir='./')
We provide the model pretrained weights trained on the COCO dataset, commonly used for training and evaluating object detection and segmentation models. Setting by_name to True ensures that the weights are correctly matched and loaded to the corresponding layers in the model.
model.load_weights('mask_rcnn_coco.h5', by_name=True)
We define a function to visualize an image with bounding boxes drawn around detected objects. We pass the filename of the image and boxes_list representing a list of boxes. Each box is represented by a tuple containing four values: (y1, x1, y2, x2), which denote the coordinates of the top-left and bottom-right corners of the bounding box. The bounding box colors are selected from a list of predefined colors.
# Draw an image with detected objectsdef draw_image_with_boxes(filename, boxes_list):# Load the imagedata = pyplot.imread(filename)# Plot the imagepyplot.imshow(data)# Get the context for drawing boxesax = pyplot.gca()ax.set_xticks([])ax.set_yticks([])colors = [(1.0, 0.0, 0.0, 1.0), # Red(0.0, 1.0, 0.0, 1.0), # Green(0.0, 0.0, 1.0, 1.0), # Blue(1.0, 1.0, 0.0, 1.0), # Yellow(0.0, 0.0, 0.0, 1.0), # Black(1.0, 1.0, 1.0, 1.0), # White(0.502, 0.0, 0.502, 1.0), # Purple(0.545, 0.271, 0.075, 1.0) # Brown]# Plot each boxfor i, box in enumerate(boxes_list):# Get coordinatesy1, x1, y2, x2 = box# Calculate width and height of the boxwidth, height = x2 - x1, y2 - y1# Create the shapecolor=colors[i]rect = Rectangle((x1, y1), width, height, fill=False, color=color, lw=5)# Draw the boxax.add_patch(rect)# Show the plotpyplot.show()
We call the detect function of the model object, which takes an image as input for detection. The result of the detection process is stored in the results variable. The results contain information about the detected objects, such as bounding box coordinates, class labels, and segmentation masks. The results structure contains:
rois: A NumPy array with the shape(N, 4), whereNis the number of detected objects. Each row represents a bounding box for a detected object, defined by its top-left and bottom-right coordinates (in the format[y1, x1, y2, x2]).'class_ids': A NumPy array of class IDs for each of the detected objects. The class ID represents the class label to which the object belongs.'scores': A NumPy array with confidence scores or probabilities associated with each of the detected objects. These scores indicate the model’s confidence in the correctness of the detection. Higher scores typically represent greater confidence.'masks': A NumPy array that represents the segmentation masks for each detected object. The format of this array is a binary mask (withTrueindicating the object andFalseindicating the background) for each object. The shape of this array is often(N, H, W), whereNis the number of objects, andHandWrepresent the height and width of the mask.
The results structure looks like this:
[{'rois':array([[ 53, 99, 345, 236],[ 24, 246, 245, 433],[133, 82, 203, 178],[275, 154, 454, 477],[ 54, 0, 367, 621],[131, 419, 231, 551],[168, 280, 237, 432],[ 28, 488, 83, 528]], dtype=int32),'class_ids': array([ 1, 1, 16, 17, 58, 16, 16, 40], dtype=int32),'scores': array([0.9947732 , 0.99246526, 0.9918096 , 0.9914255 , 0.9877041 ,0.9839083 , 0.78973764, 0.70102155], dtype=float32),'masks': array([[[False, False, False, ..., False, False, False],[False, False, False, ..., False, False, False],[False, False, False, ..., False, False, False],...,]])}]
After the detection is performed, we visualize the results by drawing bounding boxes on the input image. The results[0]['rois'] retrieves the bounding box coordinates (RoIs).
results = model.detect([img], verbose=0)draw_image_with_boxes('/usr/local/notebooks/cd.jpeg', results[0]['rois'])
The display_instances function uses this information to draw bounding boxes, overlay segmentation masks, label object classes, and display confidence scores on the image.
from mrcnn.visualize import display_instances# show photo with bounding boxes, masks, class labels and scoresdisplay_instances(img, results[0]['rois'], results[0]['masks'], results[0]['class_ids'], class_names, results[0]['scores'])
Try it yourself
You may need to re-run results = model.detect([img], verbose=0) cell in the .ipynb notebook. The reason for this is when running models on a CPU, TensorFlow might struggle with allocating or managing memory during the initial graph execution. The process of graph compilation and execution resolves itself upon retry. Please note you may see warnings, but they do not affect the output.
Run the code below to see the implementation:
Press the "Run" button and then click on the link in the widget below.
Conclusion
In conclusion, Mask R-CNN is a powerful model, for instance, segmentation, that effectively combines object detection and pixel-level segmentation. Its architecture, featuring components like the backbone network and Region Proposal Network, allows for precise identification of objects in images. Mastering this model is essential for IT professionals and researchers tackling complex visual recognition challenges, making it a cornerstone for robust computer vision solutions in a rapidly evolving landscape.
Frequently asked questions
Haven’t found what you were looking for? Contact Us
How to create masks for image segmentation?
How do you implement instance segmentation?
What is the difference between Mask R-CNN and Faster R-CNN?
Free Resources