Semi-Supervised Learning Techniques

Semi-supervised Learning is a class of Machine Learning that involves using both labelled and unlabelled data for learning problems. We will dive into the details of this concept in this lesson.

Semi-supervised learning

Supervised learning involves the usage of labelled data and unsupervised learning works without labelled data. Semi-supervised learning lies between the two fields. It makes use of both labelled and unlabelled dataset. We looked into the technique of Pseudo-Labeling in the initial lesson of this chapter.

Labeling data is a costly process. The biggest benefit of semi-supervised learning is that it requires a small amount of labelled data. In most of cases labelled data is very short, and we can leverage semi-supervised Learning to increase the amount of labelled data.

Assumptions about the data

Semi-supervised learning makes the following assumption about the dataset at hand:

  • Continuity Assumption: The Continuity Assumption states that points that are close to each other are more likely to share label.

  • Cluster Assumption: The Cluster Assumption states that data forms a discrete cluster, and points lying in the same cluster are more likely to share a label.

  • Manifold Assumption: It states that data lies on a manifold of a much lower dimension than the input space.

Applications of semi-supervised learning

  • Text Classification is a natural application of semi-supervised learning as we don’t have a large amount of labelled text to train a classifier. The same goes for Speech Analysis.

  • Semi-supervised Learning is used extensively in the field of Bioinformatics, where we process large DNA strands.

Pseudo Labeling

One great type of semi-supervised learning is Pseudo-Labeling, and participants in kaggle competitions also use them extensively.

Pseudo Labeling involves the following steps.

  1. First, we train different models on the training dataset and choose the one that gives us the best results.

  2. Next, we use the model trained in Step 1 to predict the test dataset. We don’t know if these predictions(pseudo-labels) are correct but we do know that we have quite accurate labels because the model performed well on the training dataset.

  3. In step 3, we combine the training and the pseudo-labelled test datasets. Then again, train the model as we did in step 1. This is how pseudo labeling works.

The above steps are also referred to as self-training in literature.

Get hands-on with 1300+ tech skills courses.