Gesture recognizer in deep learning

Deep learning has paved the way for numerous revolutionary techniques in the field of computer vision. With the possibility of making better models each day, it has become possible for us to apply such models in the domains of classification, recognition, and prediction. This is immensely useful in quite a lot of real-world applications that we'll see later on. Gesture recognition in images is one such example and precisely what we'll be targeting in this answer!

Gesture recognition

The process of recognizing what particular position our hand is in and what gesture it may indicate is known as gesture recognition. We can submit unlabelled images through gesture recognition applications, and a trained model can then predict what gesture the picture depicts.

What's the gesture?
What's the gesture?

It's a thumbs up!
It's a thumbs up!

MediaPipe and deep learning

MediaPipe is an open-source framework that provides various deep learning models that are trained to handle tasks like image classification, face and hand landmark detection, language detection, and more.

MediaPipe logo
MediaPipe logo

Gesture recognizer model

The model we will use for our application is a computer vision model from the framework MediaPipe. We can download the gesture_recognizer.task file from their official documentation. This task file serves as the trained model for our application, and we can simply use it to recognize patterns in new images.

Note: You can download the model called gesture_recognizer.task herehttps://ai.google.dev/edge/mediapipe/solutions/vision/gesture_recognizer and reference it in your code.

Hand landmarks

A crucial concept in gesture recognition is first identifying the coordinates of the hand and if it even exists. Hand landmarks are specific points on the hand used for tracking hand gestures.

We can then extract the hand landmarks, such as fingertips and palm center, to analyze and interpret various hand gestures accurately.

MediaPipe's default landmark numbering
MediaPipe's default landmark numbering

Code walkthrough

import cv2
import mediapipe as mp
from mediapipe.framework.formats import landmark_pb2
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
  • The first step is to import the necessary libraries for our code.

    • cv2 is OpenCV's library that is mainly useful for image processing tasks

    • mediapipe offers the particular model of gesture recognition we require

img_file = "path/image.png"
img_to_process = cv2.imread(img_file)
  • img_file refers to the image path we will predict the gestures for. We read the image and store it in the img_to_process variable using OpenCV's imread method.

hands = mp.solutions.hands.Hands(min_detection_confidence = 0.5, min_tracking_confidence = 0.5)
rgb_format_img = cv2.cvtColor(img_to_process, cv2.COLOR_BGR2RGB)
results = hands.process(rgb_format_img)
  • MediaPipe offers a solution that recognizes hands within an image and generates the respective landmarks i.e. coordinates of various points within the hand. We save the instance of this solution in hands and specify a confidence level of at least 50% in the recognition. Since MediaPipe processes images in RGB format, we first make the necessary conversion using the cv2.cvtColor method. The results variable stores the final landmarks when the hands solution is applied on the rgb_format_img.

hand_landmarks_list = []
if results.multi_hand_landmarks:
for hand_landmarks in results.multi_hand_landmarks:
hand_landmarks_protocol = landmark_pb2.NormalizedLandmarkList()
hand_landmarks_protocol.landmark.extend([
landmark_pb2.NormalizedLandmark(x = landmark.x, y = landmark.y, z = landmark.z) for landmark in hand_landmarks.landmark
])
hand_landmarks_list.append(hand_landmarks_protocol)
  • The detected hand landmarks are represented as coordinates of various points within the hand. We extract and store these landmarks in hand_landmarks_list as a list of NormalizedLandmarkList objects from the landmark_pb2 module.

mp_drawing_styles = mp.solutions.drawing_styles
mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands
  • We define objects from the MediaPipe mp module for drawing and styling the landmarks on the image.

if hand_landmarks_list:
copied_image = img_to_process.copy()
for landmark in hand_landmarks_list:
mp_drawing.draw_landmarks(
copied_image,
landmark,
mp_hands.HAND_CONNECTIONS,
mp_drawing_styles.get_default_hand_landmarks_style(),
mp_drawing_styles.get_default_hand_connections_style()
)
base_options = python.BaseOptions(model_asset_path = 'gesture_recognizer.task')
options = vision.GestureRecognizerOptions(base_options = base_options)
recognizer = vision.GestureRecognizer.create_from_options(options)
image = mp.Image.create_from_file(img_file)
recognition_result = recognizer.recognize(image)
top_gesture = recognition_result.gestures[0][0]
gesture_prediction = f"{top_gesture.category_name} ({top_gesture.score:.2f})"
cv2.putText(copied_image, gesture_prediction, (10, copied_image.shape[0] - 20) , cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)
cv2.imshow("Guess the gesture!", copied_image)
cv2.waitKey(0)
cv2.destroyAllWindows()
else:
print("No hands were detected!")
  • If the results variable contains hand landmarks, we proceed with visualizing the landmarks on the input image.

  • We create a copy of the input image, copied_image, to draw the landmarks on it. Next, we use the mp_drawing.draw_landmarks function to draw the hand landmarks using hand connections and landmark styles.

  • We specify the path to our model and the required options using python.BaseOptions and vision.GestureRecognizerOptions. Our model is referenced through "gesture_recognizer.task".

  • We initialize a gesture recognition model using the vision.GestureRecognizer class and recognize the gesture based on the hand landmarks.

  • The recognized gesture is stored in gesture_prediction. The recognized gesture is then displayed using cv2.putText, showing the gesture category name and its corresponding score.

  • Finally, we display the copied_image with the recognized gesture using cv2.imshow. The user can view the image with the recognized gesture, and it will remain open until a key is pressed.

  • If no hands are detected in the image, we print "No hands were detected!".

Executable code

Yay, we've completed our code walkthrough and can now see the code in action. You can edit the code window below and click "Run" to see the results.

import cv2
import mediapipe as mp
from mediapipe.framework.formats import landmark_pb2
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

img_file = "image2.png"
img_to_process = cv2.imread(img_file)

hands = mp.solutions.hands.Hands(min_detection_confidence = 0.5, min_tracking_confidence = 0.5)

rgb_format_img = cv2.cvtColor(img_to_process, cv2.COLOR_BGR2RGB)

results = hands.process(rgb_format_img)

hand_landmarks_list = []

if results.multi_hand_landmarks:
    for hand_landmarks in results.multi_hand_landmarks:
        hand_landmarks_protocol = landmark_pb2.NormalizedLandmarkList()
        hand_landmarks_protocol.landmark.extend([
            landmark_pb2.NormalizedLandmark(x = landmark.x, y = landmark.y, z = landmark.z) for landmark in hand_landmarks.landmark
        ])
        hand_landmarks_list.append(hand_landmarks_protocol)

mp_drawing_styles = mp.solutions.drawing_styles
mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands

if hand_landmarks_list:
    copied_image = img_to_process.copy()

    for landmark in hand_landmarks_list:
        mp_drawing.draw_landmarks(
            copied_image,
            landmark,
            mp_hands.HAND_CONNECTIONS,
            mp_drawing_styles.get_default_hand_landmarks_style(),
            mp_drawing_styles.get_default_hand_connections_style()
        )

    base_options = python.BaseOptions(model_asset_path = 'gesture_recognizer.task')
    options = vision.GestureRecognizerOptions(base_options = base_options)
    recognizer = vision.GestureRecognizer.create_from_options(options)
    image = mp.Image.create_from_file(img_file)
    recognition_result = recognizer.recognize(image)
    top_gesture = recognition_result.gestures[0][0]

    gesture_prediction = f"{top_gesture.category_name} ({top_gesture.score:.2f})"
    cv2.putText(copied_image, gesture_prediction, (10, copied_image.shape[0] - 20) , cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)

    cv2.imshow("Guess the gesture!", copied_image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
else:
    print("No hands were detected!")

Gesture recognition demonstration

Real-life applications

A wonderful aspect of such technologies is that they are crucial to many revolutionary domains in real life. Let's see how gesture recognition is important around us!

Use cases

Explanation

Human-computer interaction

Enables users to interact with computers, mobiles, or devices using hand gestures

Used for gesture-based navigation and performing tasks

Gaming

Enhances gaming experiences by allowing players to control characters and actions

Popular in motion-controlled games

Virtual reality

Enables users to interact with virtual environments using hand gestures

Provides a natural and intuitive way to pick up objects, manipulate virtual elements, and navigate

Sign language interpretation

Converts sign language gestures into text or speech, aiding communication

Augmented reality

Allows users to interact with digital content overlaid on the real world using hand gestures

Assistive technology

Helps individuals with physical disabilities to control devices

Note: Here's the complete list of related projects in MediaPipe or deep learning.

  1. Real time 3D face mesh

  2. Gesture recognizer

  3. Language detection

  4. Pose detection

  5. Emotion detection

  6. Real time emotion detection

Test your knowledge here!

Question

What does the score in the model represent?

Show Answer

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved