What is Machine Learning?
Get an introduction to machine learning, its process, and different types.
Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to enable computers to learn from and make predictions or decisions without explicit instructions.
Machine learning process
The process of machine learning typically involves the following steps:
Collecting and preparing the data: This step consists of obtaining a labeled or unlabeled dataset and then cleaning and preprocessing the data to make it suitable for the machine learning algorithm.
Choosing a model and algorithm: This step involves selecting a model and algorithm appropriate for the task based on the problem type and the data’s nature.
Training the model: This step involves using the preprocessed data to train the model and adjusting the algorithm’s parameters to optimize performance.
Evaluating the model: This step involves using a separate test dataset to evaluate the performance of the trained model and fine-tuning the parameters if necessary. When the model is evaluated and the results are not up to the requirement, this can go back to step 2.
Deploying the model: This step involves making the model available for use in real-world applications, such as websites or mobile apps. Similarly, if any problem arises at the deployment, the process can revert to step 2.
Types of machine learning
Machine learning is divided into three categories: supervised, unsupervised, and reinforcement learning.
Supervised learning
Supervised learning is the most common type of machine learning. It involves training a model on a labeled dataset where the correct output is already known. The goal is to make the model generalize with new, unseen data.
For example, a supervised learning algorithm can be trained on a dataset of pictures of cats and dogs, where the correct label (cat or dog) is already known for each picture. Once the model is trained, it can classify new images as cats or dogs. Standard algorithms used for supervised learning include linear and logistic regression, decision trees, and support vector machines (SVMs).
Example: Predicting the likelihood of bugs
One example of supervised learning in software engineering is using a machine learning model to predict the likelihood of a bug in a piece of software. This is a typical software development and maintenance task, and it can be implemented using a logistic regression algorithm.
Here’s an example of how this can be done in Python using the scikit-learn library:
from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitimport pandas as pdimport matplotlib.pyplot as plt# Load the bug data #Step 01 of the processdata = pd.DataFrame([[1000,5,2,1,0],[2000,7,5,2,1],[3000,4,10,4,0],[1000,8,1,1,0],[4000,6,5,5,1],[2000,9,3,3,1],[5000,8,8,2,0],[2000,5,1,1,0],[3000,7,2,2,1],[1000,4,5,1,0],[2000,6,4,3,1],[3000,5,3,2,1],[4000,8,6,4,0],[1000,7,1,1,1],[5000,5,8,2,0],[2000,9,2,3,1],[3000,7,3,3,1],[1000,6,1,1,0],[4000,5,5,4,0],[2000,8,2,2,1],[3000,7,4,2,1],[1000,6,1,1,0],[2000,5,3,1,1],[3000,8,2,2,1],[4000,7,4,3,1],[1000,6,1,1,0],[2000,5,3,1,1],[3000,8,2,2,1],[4000,7,4,3,1]],columns=['lines_of_code','complexity','age_of_code','number_of_authors','has_bug'])# Split the data into features and labelsX = data[["lines_of_code", "complexity", "age_of_code", "number_of_authors"]]y = data["has_bug"]# Split the data into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)# Create and train the modelmodel = LogisticRegression()#Step 02 of the processmodel.fit(X_train, y_train)#Step 03 of the process# Use the model to make predictions on the test sety_pred = model.predict(X_test)# Evaluate the model #Step 04 of the processfrom sklearn.metrics import accuracy_score, precision_score, recall_scoreprint("Accuracy: ", accuracy_score(y_test, y_pred))print("Precision: ", precision_score(y_test, y_pred))print("Recall: ", recall_score(y_test, y_pred))
In this example, the bug data is loaded into a
pandas
DataFrame and the features and labels are separated in lines 8–9. The features arelines_of_code
,complexity
,age_of_code
, andnumber_of_authors
. Thehas_bug
binary label indicates if the code has a bug.In line 12, the data is split into training and test sets in a ratio of 80:20 using the
train_test_split
function from scikit-learn. This is important to ensure that the model is tested on data it has never seen before, which is an excellent way to evaluate its generalization performance.Next, the logistic regression model creates and trains on the training data using the
fit
method in lines 15–16. Using thepredict
method, the model makes predictions on the test set in line 19.Finally, the model’s performance is evaluated using the
,accuracy The fraction of predictions the model got right. , andprecision The quality of a positive prediction made by the model. Precision refers to the number of true positives divided by the total number of positive predictions. metrics, which are provided by therecall The percentage of data samples that a machine learning model correctly identifies as belonging to a class of interest—the positive class—out of the total samples for that class. sklearn.metrics
module in lines 22–25. These metrics show how well the model can predict the presence of bugs in the software.
This example is a simple and primary use case of supervised learning in software engineering; the data would be more complex in the real world and the model might need more tweaking. Also, this is just one example of supervised learning, and different problems might require other models and techniques.
Unsupervised learning
On the other hand, unsupervised learning involves training a model on an unlabeled dataset where the correct output is unknown. The goal is to discover hidden patterns or structures in the data. For example, an unsupervised learning algorithm can be trained on a customer data dataset to find segments of similar customers. Standard algorithms used for unsupervised learning include
Example: Identifying patterns in code
An example of unsupervised learning in software engineering is using a machine learning model to identify patterns in source code and similar group functions. This is a common task in software development and maintenance, and it can be implemented using a
Here’s an example of how this can be done in Python using the scikit-learn library:
import pandas as pdfrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D# Load the source code datadata = pd.DataFrame([['Func_1',100,5,2],['Func_2',200,8,4],['Func_3',300,6,5],['Func_4',150,9,3],['Func_5',250,7,6],['Func_6',350,5,8],['Func_7',200,9,1],['Func_8',300,8,2],['Func_9',250,7,3],['Func_10',400,6,5],['Func_11',150,8,1],['Func_12',200,7,2],['Func_13',250,9,3],['Func_14',300,8,4],['Func_15',200,6,2],['Func_16',150,9,1],['Func_17',250,7,3],['Func_18',350,8,5],['Func_19',200,6,2],['Func_20',300,7,4]],columns=['Function_Name','lines_of_code','complexity','age_of_code'])# Extract the featuresX = data[["lines_of_code", "complexity", "age_of_code"]]# Create and train the K-Means modelmodel = KMeans(n_clusters=3)model.fit(X)# Use the model to predict the cluster for each functiony_pred = model.predict(X)# Plot the functions in 3Dfig = plt.figure(figsize=(8, 6))ax = fig.add_subplot(111, projection='3d')ax.scatter(X["lines_of_code"], X["complexity"], X["age_of_code"], c=y_pred)ax.set_xlabel("Lines of Code")ax.set_ylabel("Complexity")ax.set_zlabel("Age of Code")plt.show()plt.savefig('output/3dplot.png')
In this example, the data is loaded and the features are extracted as before.
The
KMeans
model is trained in lines 13–14 and used to predict the cluster for each function in line 17.The resulting predictions are used as the color of each point in the 3D plot in lines 20–27.
The 3D plot shows the functions in 3D space, with the x-axis representing lines of code, the y-axis representing complexity, and then the z-axis representing code age.
Each point in the plot is a function, and the point’s color represents the cluster to which the process belongs.
Reinforcement learning
Reinforcement learning is a type of machine learning that focuses on learning from the consequences of actions. The agent—the learning model—receives rewards or penalties based on its activities, and the goal is to understand a policy that maximizes the total compensation over time. An example of reinforcement learning would be a self-driving car learning how to navigate a city. The vehicle receives rewards for reaching its destination safely and efficiently and penalties for running red lights or causing accidents.
It’s important to note that machine learning is a complex and ever-evolving field, and new techniques and algorithms are constantly being developed. Additionally, the quality of the results is highly dependent on the quality of the data, the choice of algorithm, and the skill of the person implementing it. Therefore, it’s crucial to have a strong understanding of the underlying concepts and principles and to be familiar with the most commonly used tools and libraries, such as TensorFlow, scikit-learn, and Keras. Furthermore, understanding the limitations and ethical considerations related to the usage of machine learning is also crucial.
In summary, machine learning is a powerful tool that enables computers to learn from data without being explicitly programmed. It allows us to make predictions and automate decision-making in various industries. First, however, it’s essential to understand the method’s limitations, the data’s quality, and the ethical considerations that come with it.