Statistical Foundations and Prediction Strategies

Building on our understanding of the significance of rare event problems, this section focuses on the formulation of such problems, outlining the statistical underpinnings, defining our objectives, and discussing the inherent challenges.

Underlying statistical process

First, the statistical process behind a rare event problem is understood. This helps in selecting an appropriate approach.

The process and data commonalities in the rare event examples are:

  • Time series
  • Multivariate
  • Imbalanced binary labels

Consider our working example of a sheet-break problem. It’s from a continuous paper manufacturing process that generates a data stream.

This makes the underlying process a stochastic time series. A stochastic time series is a sequence of random variables ordered chronologically, where each data point represents an observation at a specific time and is influenced by random, unpredictable factors. These time series embody inherent uncertainty and variability, and statistical methods are frequently utilized to analyze and predict their future behavior. Examples include financial market prices, weather patterns, and certain economic data.

Press + to interact

Additionally, this is multivariate data streamed from multiple sensors placed in different machine parts. These sensors collect the process variables, such as temperature, pressure, chemical dose, and many more.

So, at any time tt, a vector of observations xtx_t is recorded. Here, xtx_t is a vector of length equal to the number of sensors and xix_it_t is the reading of the ii-th sensor at time tt. Such a process is known as a multivariate time series.

In addition to the process variables xtx_t, a binary label yty_t denoting the status of the process is also available. A positive yty_t indicates an occurrence of a rare event. Also, the class distribution is imbalanced due to positive yty_t’s rareness.

For instance, the labels in the sheet-break data denote whether the process is running normally (yt=0y_t = 0) or has a sheet break (yt=1y_t = 1). The samples with yt=0y_t = 0 and yt=1y_t = 1 are referred to as negatively and positively labeled data in the rest of the book. The former is the majority class and the latter minority. Putting them together, we have an imbalanced multivariate stochastic time series process.

Mathematical representation

Imbalanced multivariate stochastic time series process is mathematically represented as:

(yt,xt),t=1,2,...(y_t, x_t), t = 1, 2, ...

where yt{0,1}y_t ∈ \{0,1\} with 1yt=11yt=0\sum \bold{1} {y_t = 1} ≪ \sum \bold{1} {y_t = 0} and xtRpx_t ∈ \R^p with pp being the number of variables.

Problem definition

Rare event problems demand early detection or prediction to prevent the event or minimize its impact.

In literature, detection and prediction are considered different problems. However, early detection eventually becomes a prediction in the problems discussed here. For example, early detection of a condition that would lead to a sheet break is essentially predicting an imminent sheet break. This can, therefore, be formulated as a “prediction” problem.

An early prediction problem is predicting an event in advance. Suppose the event prediction is needed kk time units in advance. This kk should be chosen such that the prediction gives sufficient time to take action against the event.

Mathematical representation

Mathematically, this can be expressed as estimating the probability of yt+k=1y_{t+k} = 1 using the information at and until time tt. This can be written as

Pr[yt+k=1xt],\text{Pr}[y_{t+k} = 1|x_t−],

where xtx_t denotes xx before time tt which is xt={xt,xt1,...}x_{t}− = \{x_{t}, x_{t−1}, . . .\}.

The equation also shows that this is a classification problem. Therefore, prediction and classification are used interchangeably in this course.

Objective

The objective is to “build a binary classifier to predict a rare event in advance.” To do that, appropriate loss function and accuracy measures are selected for predictive modeling.

Loss function

There are a variety of loss functions. Among them, binary cross-entropy loss is chosen here. Cross-entropy is intuitive and has the appropriate theoretical properties that make it a good choice. Its gradient lessens the vanishing gradients issue in deep learning networks.

Moreover, from the model fitting standpoint, cross-entropy approximates the Kullback-Leibler divergence, which means minimizing cross-entropy yields an approximate estimation of the “true” underlying process distributions.

Mathematical representation

It’s defined as:

L(θ)=yt+klog(Pr[yt+k=1xt,θ])(1yt+k)log(1Pr[yt+k=1xt,θ]),\mathcal{L}(\theta) =−y_{t+k}\log(\text{Pr}[y_{t+k} = 1|x_{t}−, \theta])−(1−y_{t+k})\log(1 − \text{Pr}[y_{t+k}= 1|x_{t}−, \theta]),

where θθ denotes the model.

Entropy means randomness. The higher the entropy, the more the randomness. More randomness means a less predictable model, which means that if the model is random, it will make poor predictions.

Consider an extreme output of an arbitrary model: an absolute opposite prediction. For example, estimating Pr[yPr[y == 1]1] == 00 when yy == 11. In such a case, the loss in the equation above will be:

L=1log(θ)(11)log(10)\mathcal{L} = −1 ∗ \log(\theta)−(1−1)∗\log(1−0)

L=100\mathcal{L} = −1 ∗ −∞ − 0 ∗ 0

L=+\mathcal{L} = +∞

On the other extreme, consider an oracle model: makes absolute true prediction, which is, Pr[y=1]=1Pr[y = 1] = 1 when y=1y = 1. In this case, the cross-entropy loss will become

L=1log(1)(11)log(11)\mathcal{L} = −1 ∗ \log(1) − (1 − 1) ∗ \log(1 − 1)

L=0.\mathcal{L} = 0.

During model training, any arbitrary model is taken as a starting point. The loss is, therefore, high at the beginning. The model then trains itself to lower the loss. This is done iteratively to bring the cross-entropy loss from ++∞ to 00.

Accuracy measures

Rare event problems have highly imbalanced class distribution. The traditional misclassification accuracy metric does not work here.

This is because more than 99% of our samples are negatively labeled. A model that predicts everything, including all the minority less than 1% positive samples, as negative is still greater than 99% accurate. So, a model that cannot predict any rare event would appear accurate. The area under the ROCReceiver Operating Characteristic curve measure is unsuitable for such extremely imbalanced problems.

ROC is a graphical representation that illustrates the trade-off between the true positive rate (sensitivity) and false positive rate (11- specificity) across different threshold values.

In building a classifier, if there is any deviation from the usual, it is useful to fall back to the confusion matrix and look at other accuracy measures drawn from it. A confusion matrix design is shown in the table below.

Press + to interact
Confusion matrix
Confusion matrix

Note: In a classification context, the “Negative” class typically refers to the common or majority category, while the “Positive” class denotes the rare or minority event.

The actual negative or positive samples predicted as the same goes on the diagonal cells of the matrix as “True Negative” (TN) and “True Positive” (TP), respectively. The other two possibilities are if an actual negative is predicted as a positive and vice versa, denoted as “False Positive” (FP) and “False Negative” (FN), respectively.

In rare event classifiers, the goal is inclined towards maximizing the true positives while ensuring it does not lead to excessive false predictions. In light of this goal, the following accuracy measures are chosen and explained vis-à-vis the confusion matrix.

Recall

The percentage of positive samples correctly predicted as one, which is

recall=TPTP+FN.recall=\frac{TP}{TP + FN}.

It lies between 00 and 11. A high recall indicates TP+FNTP + FN the model’s ability to predict the minority class accurately. A recall equal to one means the model could detect all rare events. However, this can also be achieved with a dummy model that predicts everything as a rare event. To counterbalance this, we also use the f1-score.

f1-score

It is a combination (harmonic mean) of precisionA ratio of the true positives overall predicted positives. The ratio lies between 0 to 1, the higher, the better. This measure shows the model performance w.r.t. high true positives and low false positives. High precision is indicative of this and vice versa. and recall. This score indicates the model’s overall ability to predict most of the rare events with as few false alerts as possible. It is computed as:

f1-score=2(TPTP+FN)1+(TPTP+FP)1 f1\text{-}score =\frac{2}{(\frac{TP}{TP+FN})^{-1} + (\frac{TP}{TP+FP})^{-1} }

The score lies between 00 and 11, with the higher, the better. If we have the dummy model favoring high recall by predicting all samples as positive (a rare event), the f1-score will counteract it by getting close to zero.

False positive rate

Lastly, it is also critical to measure the false positive rate (fpr). It is the percentage of false alerts, which is:

fpr=FPFP+TNfpr=\frac{FP}{FP + TN}

An excess of false alerts makes us insensitive to the predictions. It is, therefore, imperative to keep the fpr as close to zero as possible.

In the following lessons, we’ll explore methodologies and tools designed to tackle these challenges, moving from theoretical formulation to practical application.