Data Preparation and Scaling
Learn to prepare data and scale features for neural network modeling.
Imports and loading data
Modeling starts with the ritualistic library imports. The code below shows all the imports and also a few declarations of constants, such as random generator seeds, the data split percent, and the size of figures to be plotted later.
import tensorflow as tffrom tensorflow.keras import optimizersfrom tensorflow.keras.models import Modelfrom tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import Inputfrom tensorflow.keras.layers import Densefrom tensorflow.keras.layers import Dropoutfrom tensorflow.keras.layers import AlphaDropoutimport pandas as pdimport numpy as npfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitfrom imblearn.over_sampling import SMOTEfrom collections import Counterimport matplotlib.pyplot as pltimport seaborn as sns# user-defined librariesimport datapreprocessingimport performancemetricsimport simpleplotsfrom numpy.random import seedseed(1)from pylab import rcParamsrcParams['figure.figsize'] = 8, 6SEED = 123 #used to help randomly select the data pointsDATA_SPLIT_PCT = 0.2print( " Data split percent: ", DATA_SPLIT_PCT )print( " Random generator seeds: ", SEED )print( " Size of figures to be plotted later: ", rcParams['figure.figsize'] )
The code above begins with importing necessary libraries from TensorFlow and Keras for building neural network models, including Sequential
for model creation, Dense
for fully connected layers, and Dropout
and AlphaDropout
for regularization to prevent overfitting. pandas and NumPy are imported for data manipulation and mathematical operations, respectively.
StandardScaler
from scikit-learn is used for feature scaling, making the data suitable for neural network training by normalizing the feature set. The train_test_split
function is used to divide the dataset into training and testing sets.
SMOTE
from imblearn
is imported for handling imbalanced datasets by oversampling the minority class. Counter
from collections is used to count the occurrences of each class in the dataset.
Matplotlib and seaborn are used for plotting, allowing for the visualization of data and model performance metrics. The rcParams
setting is adjusted to set a default figure size for all plots.
A few user-defined libraries: datapreprocessing
on line 23, performancemetrics
on line 24, and simpleplots
on line 25 are loaded. They have custom functions for pre-processing, evaluating models, and visualizing results, respectively.
On lines 33–34, we define two constants: SEED
is set to 123
, which will be used as the seed for random number generation, and DATA_SPLIT_PCT
is set to 0.2
, indicating a 20% data split for train-test separation.
On lines 36–38, we print informative messages about the data split percentage, random generator seeds, and the size of figures that will be plotted later in the code. This provides insights into the settings being used in the script.
Next, the data is loaded and processed.
# Read the datadf = pd.read_csv("processminer-sheet-break-rare-event-dataset.csv")# Convert Categorical column to hot dummy columnshotencoding1 = pd.get_dummies(df['Grade&Bwt'])hotencoding1 = hotencoding1.add_prefix('grade_')hotencoding2 = pd.get_dummies(df['EventPress'])hotencoding2 = hotencoding2.add_prefix('eventpress_')df=df.drop(['Grade&Bwt', 'EventPress'], axis=1)df=pd.concat([df, hotencoding1 , hotencoding2], axis =1)# Rename response column name for ease of understandingdf=df.rename(columns={'SheetBreak':'y'})print(df)
Data preprocessing
The objective, as mentioned earlier, is to predict a rare event in advance to prevent it, or its consequences. From a modeling standpoint, this translates to teaching the model to identify a transitional phase that would lead to a rare event.
For example, in the sheet-break problem, a transitional phase could be the speed of one of the rollers drifting away and rising in comparison to the other rollers. Such asynchronous change stretches the paper sheet. If this continues, the sheet’s tension increases and ultimately causes a break.
The sheet break would typically happen a few minutes after the drift starts. Therefore, if the model is taught to identify the start of the drift it can predict the break in advance. One simple and effective approach to achieve this is curve shifting.
Curve shifting
Curve shifting here should not be confused with curve shift in economics or covariate shift in machine learning. In economics, a curve shift is a phenomenon of the demand curve changing without any price change. A covariate shift or data shift in machine learning implies a change in data distribution due to a shift in the process. Here, it means aligning the predictors with the response to meet a certain modeling objective.
For early prediction, curve shifting moves the labels early in time. Doing so, the samples before the rare event get labeled as one. These prior samples are assumed to be the transitional phase that ultimately leads to the rare event. Providing a model with these positively labeled transitional samples teaches it to identify the “harbinger” of a rare event in time. This, in effect, is an early prediction.
Note: For early prediction, teach the model to identify the transitional phase.
Given the time series sample, , curve shifting will
- Label the prior samples to a positive sample as one, i.e.,