Data Preparation and Scaling
Learn to prepare data and scale features for neural network modeling.
We'll cover the following...
Imports and loading data
Modeling starts with the ritualistic library imports. The code below shows all the imports and also a few declarations of constants, such as random generator seeds, the data split percent, and the size of figures to be plotted later.
The code above begins with importing necessary libraries from TensorFlow and Keras for building neural network models, including Sequential for model creation, Dense for fully connected layers, and Dropout and AlphaDropout for regularization to prevent overfitting. pandas and NumPy are imported for data manipulation and mathematical operations, respectively.
StandardScaler from scikit-learn is used for feature scaling, making the data suitable for neural network training by normalizing the feature set. The train_test_split function is used to divide the dataset into training and testing sets.
SMOTE from imblearn is imported for handling imbalanced datasets by oversampling the minority class. Counter from collections is used to count the occurrences of each class in the dataset.
Matplotlib and seaborn are used for plotting, allowing for the visualization of data and model performance metrics. The rcParams setting is adjusted to set a default figure size for all plots.
A few user-defined libraries: datapreprocessing on line 23, performancemetrics on line 24, and simpleplots on line 25 are loaded. They have custom functions for pre-processing, evaluating models, and visualizing results, respectively.
On lines 33–34, we define two constants: SEED is set to 123, which will be used as the seed for random number generation, and DATA_SPLIT_PCT is set to 0.2, indicating a 20% data split for train-test separation.
On lines 36–38, we print informative messages about the data split percentage, random generator seeds, and the size of figures that will be plotted later in the code. This provides insights into the settings being used in the script.
Next, the data is loaded and processed.
Data preprocessing
The objective, as mentioned earlier, is to predict a rare event in advance to prevent it, or its consequences. From a modeling standpoint, this translates to teaching the model to identify a transitional phase that would lead to a rare event.
For example, in the sheet-break problem, a transitional phase could be the speed of one of the rollers drifting away and rising in comparison to the other rollers. Such asynchronous change stretches the paper sheet. If this continues, the sheet’s tension increases and ultimately causes a break.
The sheet break would typically happen a few minutes after the drift starts. Therefore, if the model is taught to identify the start of the drift it can predict the break in advance. One simple and effective approach to achieve this is curve shifting.
Curve shifting
Curve shifting here should not be confused with curve shift in economics or covariate shift in machine learning. In economics, a curve shift is a phenomenon of the demand curve changing without any price change. A covariate shift or data shift in machine learning implies a change in data distribution due to a shift in the process. Here, it means aligning the predictors with the response to meet a certain modeling objective.
For early prediction, curve shifting moves the labels early in time. Doing so, the samples before the rare event get labeled as one. These prior samples are assumed to be the transitional phase that ultimately leads to the rare event. Providing a model with these positively labeled transitional samples teaches it to identify the “harbinger” of a rare event in time. This, in effect, is an early prediction.
Note: For early prediction, teach the model to identify the transitional phase.
Given the time series sample, , curve shifting will
- Label the prior samples to a positive sample as one, i.e.,