Understanding Deep Learning Applications in Rare Event Prediction/

...

Data Preparation and Scaling

Learn to prepare data and scale features for neural network modeling.

We'll cover the following...

Imports and loading data
Data preprocessing

Press + to interact

Python 3.8

import tensorflow as tf
from tensorflow.keras import optimizers
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import AlphaDropout 
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split 
from imblearn.over_sampling import SMOTE
from collections import Counter 
import matplotlib.pyplot as plt
import seaborn as sns
# user-defined libraries
import datapreprocessing 
import performancemetrics 
import simpleplots
from numpy.random import seed 
seed(1)
from pylab import rcParams 
rcParams['figure.figsize'] = 8, 6
SEED = 123 #used to help randomly select the data points
DATA_SPLIT_PCT = 0.2
print( " Data split percent: ", DATA_SPLIT_PCT )
print( " Random generator seeds: ", SEED )
print( " Size of figures to be plotted later: ", rcParams['figure.figsize'] )

The code above begins with importing necessary libraries from TensorFlow and Keras for building neural network models, including Sequential for model creation, Dense for fully connected layers, and Dropout and AlphaDropout for regularization to prevent overfitting. pandas and NumPy are imported for data manipulation and mathematical operations, respectively.

StandardScaler from scikit-learn is used for feature scaling, making the data suitable for neural network training by normalizing the feature set. The train_test_split function is used to divide the dataset into training and testing sets.

SMOTE from imblearn is imported for handling imbalanced datasets by oversampling the minority class. Counter from collections is used to count the occurrences of each class in the dataset.

Matplotlib and seaborn are used for plotting, allowing for the visualization of data and model performance metrics. The rcParams setting is adjusted to set a default figure size for all plots.

A few user-defined libraries: datapreprocessing on line 23, performancemetrics on line 24, and simpleplots on line 25 are loaded. They have custom functions for pre-processing, evaluating models, and visualizing results, respectively.

On lines 33–34, we define two constants: SEED is set to 123, which will be used as the seed for random number generation, and DATA_SPLIT_PCT is set to 0.2, indicating a 20% data split for train-test separation.

On lines 36–38, we print informative messages about the data split percentage, random generator seeds, and the size of figures that will be plotted later in the code. This provides insights into the settings being used in the script.

Next, the data is loaded and processed.

Press + to interact

Python 3.8

# Read the data
df = pd.read_csv("processminer-sheet-break-rare-event-dataset.csv")
# Convert Categorical column to hot dummy columns
hotencoding1 = pd.get_dummies(df['Grade&Bwt'])
hotencoding1 = hotencoding1.add_prefix('grade_') 
hotencoding2 = pd.get_dummies(df['EventPress']) 
hotencoding2 = hotencoding2.add_prefix('eventpress_')
df=df.drop(['Grade&Bwt', 'EventPress'], axis=1)
df=pd.concat([df, hotencoding1 , hotencoding2], axis =1)
# Rename response column name for ease of understanding
df=df.rename(columns={'SheetBreak':'y'})
print(df)

Press + to interact

The sheet break would typically happen a few minutes after the drift starts. Therefore, if the model is taught to identify the start of the drift it can predict the break in advance. One simple and effective approach to achieve this is curve shifting.

Curve shifting

Curve shifting here should not be confused with curve shift in economics or covariate shift in machine learning. In economics, a curve shift is a phenomenon of the demand curve changing without any price change. A covariate shift or data shift in machine learning implies a change in data distribution due to a shift in the process. Here, it means aligning the predictors with the response to meet a certain modeling objective.

For early prediction, curve shifting moves the labels early in time. Doing so, the samples before the rare event get labeled as one. These prior samples are assumed to be the transitional phase that ultimately leads to the rare event. Providing a model with these positively labeled transitional samples teaches it to identify the “harbinger” of a rare event in time. This, in effect, is an early prediction.

Note: For early prediction, teach the model to identify the transitional phase.

Given the time series sample, $(y_t, x_t), t = 1, 2, . . .$ , curve shifting will

Label the $k$ prior samples to a positive sample as one, i.e., $y_t−1,...,y_{t−k} ← 1$

...

Getting Started

Rare Event Prediction

Multi-Layer Perceptrons (MLPs)

Long Short-Term Memory (LSTM) Networks

Convolutional Neural Networks (CNNs)

Autoencoders

Conclusion

Data Preparation and Scaling

Imports and loading data

Data preprocessing

Curve shifting