Efficient Pooling Strategies

Explore how efficient summary statistics in pooling layers can optimize feature extraction in convolutional networks.

We'll cover the following...

The strength of a convolutional network is its ability to simplify the feature extraction process. In this, pooling plays a critical role by removing the extraneous information. A pooling operation summarizes features into a summary statistic. It, therefore, relies on the statistic’s efficiency. Whether the statistic preserves the relevant information or loses it depends on its efficiency.

What’s an efficient summary statistic?

A summary statistic is a construct from principles of data reduction. It summarizes a set of observations to preserve the largest amount of information as succinctly as possible.

Therefore, an efficient summary statistic is one that concisely contains the most information about a sample, such as the sample mean or maximum. Other statistics like the sample skewness or sample size, do not contain as much relevant information and, therefore, are not efficient for pooling. This lesson lays out the theory of summary statistics to learn about efficient statistics for pooling.

“An experimenter might wish to summarize the information in a sample by determining a few key features of the sample values. This is usually done by computing (summary) statistics—functions of the sample.” (Casella and Berger 2002)

Learning the dependence of pooling on the efficiency of summary statistics and the theory behind them is rewarding. It provides answers to questions like:

  • Currently, max-pool and average-pool are the most common. Could there be other equally or more effective pooling statistics?

  • Max-pool is found to be robust and, therefore, better than others in most problems. What is the cause of max pooling’s robustness?

  • Can more than one pooling statistic be used together? If yes, how to find the best combination of statistics?

This lesson goes deeper into the theory of extracting meaningful features in the pooling layer. In doing so, the above questions are answered. Moreover, the theory behind summary statistics also provides an understanding of appropriately choosing a single or a set of statistics for pooling.

Note: Pooling operation computes a summary statistic, and its efficacy relies on the efficiency of the statistic.

In the following, summary statistics applicable to pooling are explained in three categories:

  • Sufficient (minimal) statistics
  • Complete statistics
  • Ancillary statistics

Definitions

The feature map outputted by a convolutional layer is the input to a pooling layer. The feature map is a random variable

X={X1,,Xn}X = \{X_1,\ldots , X_n\}

where nn is the feature map size.

An observation of the random variable is denoted as:

x={x1,,xn}.x = \{x_1,\ldots, x_n\}.

Describing properties of random variables is beyond the scope of this course, but it suffices to know that their true underlying distribution and parameters are unknown. The distribution function, that is, the pdf or pmfPdf or pmf refers to the probability density function or probability mass function for continuous or discrete distributions, respectively. for the random variable XX is denoted as ff. The distribution has an underlying unknown parameter θ\theta. Therefore, the θ\theta characterizes the observed xx and should be estimated.

A summary statistic of f(X)f(X) is an estimate of θ\theta. The statistic is a function of the random variable denoted as T(X)T(X) and computed as T(x)T(x) from the sample observations. The sample mean, median, maximum, standard deviation and many more are examples of the function TT. The goal is to determine TT’s that contain the most information of the feature map, achieve the most data reduction, and are the most efficient. These TT’s are the best choice for pooling in convolutional networks.

Sufficient (minimal) statistics

The concept of sufficient statistics lays down the foundation of data reduction by summary statistics.

“A sufficient statistic for a distribution parameter θ\theta is a statistic that, in a certain sense, captures all the information about θ\theta contained in the sample.” (Casella and Berger 2002)

It is formally defined as follows.

Sufficient statistic

A statistic T(X)T(X) is a sufficient statistic for θ\theta if the sample conditional distribution f(XT(X))f(X|T(X)) does not depend on θ\theta.

The definition can be interpreted as the conditional distribution of XX given T(X)T(X), that is, f(XT(X))f(X|T(X)), is independent of θ\theta. This implies that in the presence of the statistic T(X)T(X), any remaining information in the underlying parameter θ\theta is not required.

Note: A sufficient statistic can replace the distribution parameter θ\theta.

It is possible only if T(X)T(X) contains all the information about θ\theta available in XX. Therefore, T(X)T(X) becomes a sufficient statistic to represent the sample in place of θ\theta.

For example,

  • Mean: The sample mean,

    T(X)=Xˉ=iXin,T(X) = \bar{X}=\frac{\sum_{i} X_i}{n},

    is a sufficient statistic for a sample from a normal or exponential distribution.

  • Maximum: The sample maximum,

    T(X)=X(n),T(X) = X_{(n)},

    where X(n)=maxiXi,i=1,,nX_{(n)} = \text{max}_i X_i, i = 1,\ldots, n is the nn-th order statistic, is a sufficient statistic in a (truncated) uniform distribution or approximately in a Weibull distribution if its shape parameter is large.

The average-pool (AvgPool) and max-pool (MaxPool) indirectly originated from here. They’re commonly used pooling methods. Between them, MaxPool is more popular.

Factorization theorem

Theorem 1. A statistic T(X)T(X) is sufficient if and only if functions g(tθ)g(t|θ) and h(x)h(x) can be found such that f(xθ)=g(T(x)θ)h(x)f(x|θ) = g(T (x)|θ)h(x).

Proposition 1. If X1,...,XnX_1,...,X_n are iidiid normal distributed N(μ,σ2)N(μ,σ^2), the sample mean,

xˉ=inxin\bar{x}=\frac{\sum_{i}^{n}x_{i}}{n}

and sample variance

s2=in(xixˉ)2(n1)s^{2}=\frac{\sum_{i}^{n}(x_{i}-\bar{x})^{2}}{(n-1)} ...