What kind of patterns can be mined in data mining?

Overview

Different types of data can be mined in data mining. However, the data should have a pattern to get helpful information. Click here to learn more about what data types can be mined using data mining techniques.

Based on the data functionalities, patterns can be further classified into two categories.

Descriptive patterns

It deals with the general characteristics and converts them into relevant and helpful information.

Descriptive patterns can be divided into the following patterns: 

  • Class/concept description: Data entries are associated with labels or classes. For instance, in a library, the classes of items for borrowed items include books and research journals, and customers' concepts include registered members and not registered members. These types of descriptions are class or concept descriptions.

  • Frequent patterns: These are data points that occur more often in the dataset. There are many kinds of recurring patterns, such as frequent items, frequent subsequence, and frequent sub-structure.

  • Associations: It shows the relationships between data and pre-defined association rules. For instance, a shopkeeper makes an association rule that 70% of the time, when a football is sold, a kit is bought alongside. These two items can be combined together to make an association.

  • Correlations: This is performed to find the statistical correlations between two data points to find if they have positive, negative, or no effect.

  • Clusters: This is the formation of a group of similar data points. Each point in the collection is somewhat similar but very different from other members of different groups.

Let's delve into the practical implementation of clustering through code. It provides a fundamental technique for discovering patterns within data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.cluster import DBSCAN
X, _= make_classification(
n_samples=1000,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=4
)
df = pd.DataFrame(X)
print(df.shape)
# # define the model
dbscan_model = DBSCAN(eps=0.35,min_samples=16)
# # train the model
dbscan_model.fit(df)
# #visualize the clusters.
plt.figure(figsize=(10,10))
plt.scatter(df[0],df[1],c = dbscan_model.labels_,s=15)
plt.title('DBSCAN Clustering',fontsize=20)
plt.xlabel('Feature 1',fontsize=14)
plt.ylabel('Feature 2',fontsize=14)
plt.show()

Note: To read more about the DBSCAN algorithm, check out this answer.

  • Lines 1–5: We import the neccessary libraries for use.

  • Lines 7–14: We create a random dataset with 1000 samples and 2 features.

  • Lines 16–17: We convert the dataset output X into a data frame and print the shape of the data frame.

  • Line 20: We initialize the DBSCAN model with an eps=0.35 and min_samples=16, both of which need to be tuned to obtain the optimal number of clusters and detect noise better.

  • Line 23: We fit the model to the dataset and generate clusters.

  • Lines 26–30: We visualize the clusters using a scatter plot.

Predictive patterns

It predicts future values by analyzing the data patterns and their outcomes based on the previous data. It also helps us find missing values in the data.

Predictive patterns can be categorized into the following patterns.

  • Classification: It helps predict the label of unknown data points with the help of known data points. For instance, if we have a dataset of X-rays of cancer patients, then the possible labels would be cancer patient and not cancer patient. These classes can be obtained by data characterizations or by data discrimination.

  • Regression: Unlike classification, regression is used to find the missing numeric values from the dataset. It is also used to predict future numeric values as well. For instance, we can find the behavior of the next year's sales based on the past twenty years' sales by finding the relation between the data.

  • Outlier analysis: Not all data points in the dataset need to follow the same behavior. Data points that don't follow the usual behavior are called outliers. Analysis of these outliers is called outlier analysis. These outliers are not considered while working on the data.

  • Evolution analysis: As the name suggests, those data points change their behavior and trends with time.

From predictive patterns, let's see the practical implementation of regression through code. It is an essential predictive pattern used to understand the relationship between variables and make predictions based on observed data.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
Y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', linewidth=2) # Regression line
plt.xlabel("True Values")
plt.ylabel("Predicted Values")
plt.title("True vs. Predicted Values")
plt.savefig('./output/plot.png')
plt.show()

The code loads the Boston housing dataset and splits it into training and testing sets. On line 15, initializes a linear regression model and fits it to the training data. Predictions are made on the testing set, and the mean squared error (MSE) is calculated to evaluate the model's performance. The code generates a scatter plot comparing true values against predicted values, with a dashed line indicating perfect predictions.

Copyright ©2024 Educative, Inc. All rights reserved