Data Preparation and Scaling

Learn to prepare data and scale features for neural network modeling.

Imports and loading data

Modeling starts with the ritualistic library imports. The code below shows all the imports and also a few declarations of constants, such as random generator seeds, the data split percent, and the size of figures to be plotted later.

Press + to interact
import tensorflow as tf
from tensorflow.keras import optimizers
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import AlphaDropout
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
# user-defined libraries
import datapreprocessing
import performancemetrics
import simpleplots
from numpy.random import seed
seed(1)
from pylab import rcParams
rcParams['figure.figsize'] = 8, 6
SEED = 123 #used to help randomly select the data points
DATA_SPLIT_PCT = 0.2
print( " Data split percent: ", DATA_SPLIT_PCT )
print( " Random generator seeds: ", SEED )
print( " Size of figures to be plotted later: ", rcParams['figure.figsize'] )

The code above begins with importing necessary libraries from TensorFlow and Keras for building neural network models, including Sequential for model creation, Dense for fully connected layers, and Dropout and AlphaDropout for regularization to prevent overfitting. pandas and NumPy are imported for data manipulation and mathematical operations, respectively.

StandardScaler from scikit-learn is used for feature scaling, making the data suitable for neural network training by normalizing the feature set. The train_test_split function is used to divide the dataset into training and testing sets.

SMOTE from imblearn is imported for handling imbalanced datasets by oversampling the minority class. Counter from collections is used to count the occurrences of each class in the dataset.

Matplotlib and seaborn are used for plotting, allowing for the visualization of data and model performance metrics. The rcParams setting is adjusted to set a default figure size for all plots.

A few user-defined libraries: datapreprocessing on line 23, performancemetrics on line 24, and simpleplots on line 25 are loaded. They have custom functions for pre-processing, evaluating models, and visualizing results, respectively.

On lines 33–34, we define two constants: SEED is set to 123, which will be used as the seed for random number generation, and DATA_SPLIT_PCT is set to 0.2, indicating a 20% data split for train-test separation.

On lines 36–38, we print informative messages about the data split percentage, random generator seeds, and the size of figures that will be plotted later in the code. This provides insights into the settings being used in the script.

Next, the data is loaded and processed.

Press + to interact
# Read the data
df = pd.read_csv("processminer-sheet-break-rare-event-dataset.csv")
# Convert Categorical column to hot dummy columns
hotencoding1 = pd.get_dummies(df['Grade&Bwt'])
hotencoding1 = hotencoding1.add_prefix('grade_')
hotencoding2 = pd.get_dummies(df['EventPress'])
hotencoding2 = hotencoding2.add_prefix('eventpress_')
df=df.drop(['Grade&Bwt', 'EventPress'], axis=1)
df=pd.concat([df, hotencoding1 , hotencoding2], axis =1)
# Rename response column name for ease of understanding
df=df.rename(columns={'SheetBreak':'y'})
print(df)

Data preprocessing

The objective, as mentioned earlier, is to predict a rare event in advance to prevent it, or its consequences. From a modeling standpoint, this translates to teaching the model to identify a transitional phase that would lead to a rare event.

For example, in the sheet-break problem, a transitional phase could be the speed of one of the rollers drifting away and rising in comparison to the other rollers. Such asynchronous change stretches the paper sheet. If this continues, the sheet’s tension increases and ultimately causes a break.

Press + to interact

The sheet break would typically happen a few minutes after the drift starts. Therefore, if the model is taught to identify the start of the drift it can predict the break in advance. One simple and effective approach to achieve this is curve shifting.

Curve shifting

Curve shifting here should not be confused with curve shift in economics or covariate shift in machine learning. In economics, a curve shift is a phenomenon of the demand curve changing without any price change. A covariate shift or data shift in machine learning implies a change in data distribution due to a shift in the process. Here, it means aligning the predictors with the response to meet a certain modeling objective.

For early prediction, curve shifting moves the labels early in time. Doing so, the samples before the rare event get labeled as one. These prior samples are assumed to be the transitional phase that ultimately leads to the rare event. Providing a model with these positively labeled transitional samples teaches it to identify the “harbinger” of a rare event in time. This, in effect, is an early prediction.

Note: For early prediction, teach the model to identify the transitional phase.

Given the time series sample, (yt,xt),t=1,2,...(y_t, x_t), t = 1, 2, . . ., curve shifting will

  • Label the kk prior samples to a positive sample as one, i.e., yt1,...,ytk1y_t−1,...,y_{t−k} ← 1
...