Zero-shot learning (ZSL) is a machine learning method that allows a model to recognize and categorize objects or concepts that it has never seen before without any prior training or instances of those specific objects or concepts. This type of learning method is significant for
In traditional supervised learning, a model is trained using a dataset with labels for all the classes it needs to classify. However, zero-shot learning extends this capability by allowing a model to generalize to previously unseen classes using extra information or knowledge.
Zero-shot learning leverages the inherent skill of humans to perceive resemblances between various categories of data, enabling it to make accurate predictions even without prior training on the specific classes. Humans can recognize that both cows and deer possess horns, walk on four legs, and share other common visual traits. This capability allows them to establish connections between new or unfamiliar categories and previously encountered visual concepts.
Suppose a model has been trained to recognize many dog breeds but has never seen a specific type, such as a "Komondor." In that case, ZSL allows the model to identify and categorize the Komondor based on its knowledge of other dog breeds. Like humans, ZSL relies on utilizing existing knowledge to accomplish its goals.
Here's a basic flowchart that shows the working mechanism of zero-shot learning:
In ZSL, the data is categorized into three kinds:
Seen classes: These are the data classes used to train the deep learning model.
Unseen classes: These are the classes that the model must be capable of classifying without specific training. The training procedure did not incorporate any data from these classes.
Auxiliary information: More information is necessary to handle the ZSL challenge due to the lack of labeled instances in the unseen classes. This additional information should include details on all unseen classes, such as descriptions, semantic details, or embedding of words.
This step aims to collect or make a dataset that includes labeled examples from a group of seen classes. This dataset should also have extra information about the classes that will be helpful for zero-shot learning.
During this phase, a model acquires knowledge by studying a set of data samples that have been appropriately labeled. This phase generally consists of two steps:
Feature extraction: The input data, such as images or text, is transformed into meaningful features in this phase.
Model training: In this phase, the model learns to map the extracted features to the corresponding class labels.
During this phase, the model applies the acquired knowledge and additional information to categorize a new set of classes. This phase also generally consists of two steps:
Semantic representation: The auxiliary information represents the semantic properties of unseen classes.
Class prediction: During inference, the model takes the features extracted from unseen examples and compares them with the semantic representations of the unseen classes. The model predicts the class label by finding the closest match between the features and the semantic representations.
Note: Some variations and techniques exist within zero-shot learning, and the exact implementation may vary depending on the specific approach or algorithm being used. The flowchart above provides a general overview of the process.
Here are some commonly used evaluation metrics in zero-shot learning:
Top-1 accuracy: This metric measures the percentage of test instances correctly classified into their respective unseen classes.
Top-k accuracy: Besides the top-1 accuracy, this metric measures the percentage of test instances where the correct label appears in the top-k predicted labels.
Harmonic generalized precision (HGP): This metric measures how well the model assigns the correct label to unseen classes while considering the ordering of the predicted classes.
Harmonic generalized recall (HGR): This metric measures how well the model retrieves instances of unseen classes while considering the ordering of the predictions.
Mean average precision (mAP): This metric is commonly used in information retrieval tasks and measures the average precision for each class. It considers the precision and recall of the model's predictions across different thresholds and calculates the average across all classes.
Area under the curve (AUC): This metric measures the model's ability to rank the unseen classes correctly. It plots the true positive rate against the false positive rate and calculates the area under this curve.
Note: The choice of evaluation metric depends on the specific zero-shot learning task and the study's objectives.
Let's compare how zero-shot, few-shot and one-shot learning differ. Here are several key differentiators:
Zero-Shot Learning (ZSL) | Few-Shot Learning (FSL) | One-Shot Learning (OSL) |
It does not require any labeled examples for the unseen classes. | It uses a small number of labeled examples per class. | It uses a single labeled example per class. |
It focuses on generalizing to unseen classes based on auxiliary information or semantic relationships. | It aims to achieve better generalization by leveraging information from multiple examples. | It focuses on capturing similarities between instances to make predictions. |
It relies on transferring knowledge from known classes to unseen classes. | It aims to learn from limited labeled examples to generalize new instances or tasks well. | It emphasizes capturing instance similarities. |
It is beneficial when dealing with scenarios where instances must be classified into unseen or new classes that were not present during training. | It applies in scenarios with only a few labeled examples per class. | It is particularly relevant when there is an extreme scarcity of labeled examples per class, typically having only one labeled example available. |
It finds applications in areas where new classes or concepts constantly emerge, such as fine-grained categorization, object recognition, natural language processing tasks, or cross-domain adaptation. | It finds applications in tasks such as image recognition, object detection, text classification, and speech recognition, where data scarcity or data annotation costs limit the availability of labeled examples. | It finds applications in various fields, including face recognition, signature verification, character recognition, and anomaly detection, where acquiring large amounts of labeled data is impractical or expensive. |
The utilization of zero-shot learning offers numerous advantages, outlined below:
Generalization to unseen classes
Reduced annotation efforts
Flexibility to handle evolving datasets
Enhanced scalability by avoiding retraining from scratch
Ability to leverage semantic relationships between classes
Improved efficiency in adapting to new tasks or domains
Facilitation of transfer learning and knowledge transfer between related tasks or domains
Reduction of data bias by avoiding overfitting to specific classes during the training
Support for multilabel classification tasks by predicting multiple class labels simultaneously
There are several drawbacks associated with zero-shot learning, outlined as follows:
Limited generalization
Lack of fine-grained discrimination
Dependency on the accurate attribute information
Difficulty in scaling to large class spaces
Vulnerability to semantic noise
Bias and cultural influences
Limited ability to handle interclass relationships
The column on the left lists the different learning methods, and the column on the right lists the example scenario of each learning method. Try matching the scenarios valid for each method.
Zero-shot learning (ZSL)
Given a single sample of a person’s signature, the system must determine whether a new signature belongs to the same person.
Few-shot learning (FSL)
The system must do accurate translations in machine translation when translating a language pair with no parallel training data.
One-shot learning (OSL)
After training a model with a few handwritten examples, a system must accurately recognize and transcribe new samples of an individual’s handwriting.