Multimodal deep learning is a branch of machine learning wherein the objective is to train AI models in processing and discovering correlations among various data types, also known as "modalities". A deep learning model can comprehensively understand its environment by integrating various modalities, as certain cues are exclusively present in specific modalities. The term modalities refer to several data sources in multimodal deep learning. The essential modalities are as follows:
Text: It represents written or spoken language in various forms, such as sentences, paragraphs, or documents.
Images: It includes visual data, such as pictures, illustrations, and video frames.
Audio: It includes data based on sounds, such as speech, music, or environmental sounds, which is an essential modality for understanding the audible world.
Video: It consists of images or frames that are typically accompanied by sounds.
Sensor data: It includes readings from accelerometers, gyroscopes, temperature sensors, etc.
The most favored combinations consist of the three modalities that hold the highest popularity:
Image + Text
Image + Text + Audio
Text + Audio
Image + Audio
One example of multimodal deep learning is in autonomous driving, where data from various sensors such as cameras, LiDAR, and radar is combined using deep neural networks. This fusion enables the autonomous vehicle to better perceive and understand its environment, allowing it to make safer and more informed decisions while navigating the road.
Multimodal models arise out of the need to better understand and process information that comes from multiple sources or modalities, such as text, images, audio, video, etc. While unimodal or monomodal models can be effective for processing data from a single modality, they have limitations when it comes to handling diverse and complex data that often involve multiple modalities. Here are some key reasons why multimodal models are necessary despite the existence of unimodal models:
Combining information from different modalities can lead to a more comprehensive and richer representation of the underlying data.
Multimodal models can facilitate cross-modal inference, where information from one modality can help in understanding another.
Multimodal models can be more robust to noise or missing data in one modality.
Multimodal deep learning addresses several challenges related to processing and understanding data that comes from multiple modalities. The core challenges it helps to solve are:
Data fusion: Integrating information from multiple sources effectively.
Translation: Converting information between different modalities (e.g., text to image).
Co-learning: Jointly training the model with data from multiple modalities to capture interactions and dependencies effectively.
Cross-modal representation learning: Learning shared representations for diverse modalities.
Alignment and matching: Finding correspondences between elements from different modalities.
Complementary information exploitation: Leveraging unique and complementary information from each modality.
Handling incomplete or missing data: Dealing with incomplete or unavailable data from different modalities.
In the working mechanism of multimodal deep learning, the following steps are involved:
Unimodal encoding: The process begins by combining multiple unimodal neural networks, such as visual and audio networks. Each unimodal network processes its respective input data separately through encoding. For instance, visual data is fed into one network, while audio data is fed into another. So, the encoding involves extracting relevant features and information from each modality.
Multimodal data fusion: Once the unimodal encoding is complete, the information obtained from each modality needs to be fused together. Several fusion techniques are available, ranging from simple concatenation to more sophisticated attention mechanisms. The fusion process is crucial as it determines how well the different modalities are integrated and how effectively the combined information is represented.
Decision network training: After the fusion step, the fused and encoded information is passed on to a final "decision" network. This decision network is specifically designed for the end task that the multimodal learning model aims to accomplish. The network takes the integrated information as input and is trained to make informed decisions or predictions based on the combined knowledge from all modalities.
Here is a diagram demonstrating the architectural workflow of multimodal learning:
Researchers and organizations have developed numerous multimodal datasets. Below is an inclusive compilation of the most widely used datasets in the domain:
COCO-Captions Dataset: Microsoft released 330K images with short text descriptions for each imagez for image captioning research.
Kinetics 400/600/700: Audiovisual dataset with YouTube videos for human action recognition, suitable for action recognition, human pose estimation, or scene understanding.
RGB-D Object Dataset: Combines RGB and depth sensor modalities, with videos of 300 household objects and 22 scenes, used for 3D object detection and depth estimation tasks.
VQA: A Visual Question Answering dataset with 265K images and at least three questions per image, requiring vision, language, and common sense knowledge for answers.
CMU-MOSEI: A multimodal dataset for emotion recognition and sentiment analysis, including 23,500 sentences pronounced by 1,000 YouTube speakers, combining video, audio, and text modalities.
Here are some of the exciting areas where multimodal learning has made significant contributions include:
It is useful to generate descriptive captions for images.
It is useful for visual Question Answering (QA) in interactive AI systems, allowing users to ask questions about images and receive relevant answers from the model.
It is useful for medical image analysis in the healthcare industry by enabling accurate diagnosis and treatment planning by combining data from medical images like MRIs, CT scans, and X-rays with patient records.
It is useful in various fields like human-computer interaction, marketing, and mental health, enabling emotion recognition through analyzing facial expressions, vocal intonations, and textual content.
Multimodal deep learning represents a promising advancement in AI models, harnessing the potential of multiple modalities to convey richer information and improve predictive performance. Although challenges in training such networks exist, the widespread applications and opportunities it unlocks in various fields make it a highly active and vital research area.