Deep learning has paved the way for numerous revolutionary techniques in the field of computer vision. With the possibility of making better models each day, it has become possible for us to apply such models in the domains of classification, recognition, and prediction. This is immensely useful in quite a lot of real-world applications that we'll see later on. Gesture recognition in images is one such example and precisely what we'll be targeting in this answer!
The process of recognizing what particular position our hand is in and what gesture it may indicate is known as gesture recognition. We can submit unlabelled images through gesture recognition applications, and a trained model can then predict what gesture the picture depicts.
MediaPipe is an open-source framework that provides various deep learning models that are trained to handle tasks like image classification, face and hand landmark detection, language detection, and more.
The model we will use for our application is a computer vision model from the framework MediaPipe. We can download the gesture_recognizer.task
file from their official documentation. This task file serves as the trained model for our application, and we can simply use it to recognize patterns in new images.
Note: You can download the model called gesture_recognizer.task
and reference it in your code. here https://ai.google.dev/edge/mediapipe/solutions/vision/gesture_recognizer
A crucial concept in gesture recognition is first identifying the coordinates of the hand and if it even exists. Hand landmarks are specific points on the hand used for tracking hand gestures.
We can then extract the hand landmarks, such as fingertips and palm center, to analyze and interpret various hand gestures accurately.
import cv2import mediapipe as mpfrom mediapipe.framework.formats import landmark_pb2from mediapipe.tasks import pythonfrom mediapipe.tasks.python import vision
The first step is to import the necessary libraries for our code.
cv2
is OpenCV's library that is mainly useful for image processing tasks
mediapipe
offers the particular model of gesture recognition we require
img_file = "path/image.png"img_to_process = cv2.imread(img_file)
img_file
refers to the image path we will predict the gestures for. We read the image and store it in the img_to_process
variable using OpenCV's imread
method.
hands = mp.solutions.hands.Hands(min_detection_confidence = 0.5, min_tracking_confidence = 0.5)rgb_format_img = cv2.cvtColor(img_to_process, cv2.COLOR_BGR2RGB)results = hands.process(rgb_format_img)
MediaPipe offers a solution that recognizes hands within an image and generates the respective landmarks i.e. coordinates of various points within the hand. We save the instance of this solution in hands
and specify a confidence level of at least 50% in the recognition. Since MediaPipe processes images in RGB format, we first make the necessary conversion using the cv2.cvtColor
method. The results
variable stores the final landmarks when the hands
solution is applied on the rgb_format_img
.
hand_landmarks_list = []if results.multi_hand_landmarks:for hand_landmarks in results.multi_hand_landmarks:hand_landmarks_protocol = landmark_pb2.NormalizedLandmarkList()hand_landmarks_protocol.landmark.extend([landmark_pb2.NormalizedLandmark(x = landmark.x, y = landmark.y, z = landmark.z) for landmark in hand_landmarks.landmark])hand_landmarks_list.append(hand_landmarks_protocol)
The detected hand landmarks are represented as coordinates of various points within the hand. We extract and store these landmarks in hand_landmarks_list
as a list of NormalizedLandmarkList
objects from the landmark_pb2
module.
mp_drawing_styles = mp.solutions.drawing_stylesmp_drawing = mp.solutions.drawing_utilsmp_hands = mp.solutions.hands
We define objects from the MediaPipe mp
module for drawing and styling the landmarks on the image.
if hand_landmarks_list:copied_image = img_to_process.copy()for landmark in hand_landmarks_list:mp_drawing.draw_landmarks(copied_image,landmark,mp_hands.HAND_CONNECTIONS,mp_drawing_styles.get_default_hand_landmarks_style(),mp_drawing_styles.get_default_hand_connections_style())base_options = python.BaseOptions(model_asset_path = 'gesture_recognizer.task')options = vision.GestureRecognizerOptions(base_options = base_options)recognizer = vision.GestureRecognizer.create_from_options(options)image = mp.Image.create_from_file(img_file)recognition_result = recognizer.recognize(image)top_gesture = recognition_result.gestures[0][0]gesture_prediction = f"{top_gesture.category_name} ({top_gesture.score:.2f})"cv2.putText(copied_image, gesture_prediction, (10, copied_image.shape[0] - 20) , cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)cv2.imshow("Guess the gesture!", copied_image)cv2.waitKey(0)cv2.destroyAllWindows()else:print("No hands were detected!")
If the results
variable contains hand landmarks, we proceed with visualizing the landmarks on the input image.
We create a copy of the input image, copied_image
, to draw the landmarks on it. Next, we use the mp_drawing.draw_landmarks
function to draw the hand landmarks using hand connections and landmark styles.
We specify the path to our model and the required options using python.BaseOptions
and vision.GestureRecognizerOptions
. Our model is referenced through "gesture_recognizer.task".
We initialize a gesture recognition model using the vision.GestureRecognizer
class and recognize the gesture based on the hand landmarks.
The recognized gesture is stored in gesture_prediction
. The recognized gesture is then displayed using cv2.putText
, showing the gesture category name and its corresponding score.
Finally, we display the copied_image
with the recognized gesture using cv2.imshow
. The user can view the image with the recognized gesture, and it will remain open until a key is pressed.
If no hands are detected in the image, we print "No hands were detected!".
Yay, we've completed our code walkthrough and can now see the code in action. You can edit the code window below and click "Run" to see the results.
import cv2 import mediapipe as mp from mediapipe.framework.formats import landmark_pb2 from mediapipe.tasks import python from mediapipe.tasks.python import vision img_file = "image2.png" img_to_process = cv2.imread(img_file) hands = mp.solutions.hands.Hands(min_detection_confidence = 0.5, min_tracking_confidence = 0.5) rgb_format_img = cv2.cvtColor(img_to_process, cv2.COLOR_BGR2RGB) results = hands.process(rgb_format_img) hand_landmarks_list = [] if results.multi_hand_landmarks: for hand_landmarks in results.multi_hand_landmarks: hand_landmarks_protocol = landmark_pb2.NormalizedLandmarkList() hand_landmarks_protocol.landmark.extend([ landmark_pb2.NormalizedLandmark(x = landmark.x, y = landmark.y, z = landmark.z) for landmark in hand_landmarks.landmark ]) hand_landmarks_list.append(hand_landmarks_protocol) mp_drawing_styles = mp.solutions.drawing_styles mp_drawing = mp.solutions.drawing_utils mp_hands = mp.solutions.hands if hand_landmarks_list: copied_image = img_to_process.copy() for landmark in hand_landmarks_list: mp_drawing.draw_landmarks( copied_image, landmark, mp_hands.HAND_CONNECTIONS, mp_drawing_styles.get_default_hand_landmarks_style(), mp_drawing_styles.get_default_hand_connections_style() ) base_options = python.BaseOptions(model_asset_path = 'gesture_recognizer.task') options = vision.GestureRecognizerOptions(base_options = base_options) recognizer = vision.GestureRecognizer.create_from_options(options) image = mp.Image.create_from_file(img_file) recognition_result = recognizer.recognize(image) top_gesture = recognition_result.gestures[0][0] gesture_prediction = f"{top_gesture.category_name} ({top_gesture.score:.2f})" cv2.putText(copied_image, gesture_prediction, (10, copied_image.shape[0] - 20) , cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2) cv2.imshow("Guess the gesture!", copied_image) cv2.waitKey(0) cv2.destroyAllWindows() else: print("No hands were detected!")
A wonderful aspect of such technologies is that they are crucial to many revolutionary domains in real life. Let's see how gesture recognition is important around us!
Use cases | Explanation |
Human-computer interaction | Enables users to interact with computers, mobiles, or devices using hand gestures |
Used for gesture-based navigation and performing tasks | |
Gaming | Enhances gaming experiences by allowing players to control characters and actions |
Popular in motion-controlled games | |
Virtual reality | Enables users to interact with virtual environments using hand gestures |
Provides a natural and intuitive way to pick up objects, manipulate virtual elements, and navigate | |
Sign language interpretation | Converts sign language gestures into text or speech, aiding communication |
Augmented reality | Allows users to interact with digital content overlaid on the real world using hand gestures |
Assistive technology | Helps individuals with physical disabilities to control devices |
Note: Here's the complete list of related projects in MediaPipe or deep learning.
Test your knowledge here!
What does the score
in the model represent?
Free Resources