Fisher information is a statistical technique that encapsulates how close or far some random instance of a variable is from its true parameter value. It may occur so that there are many parameter values on which a probability distribution depends. In that case, there is a different value for each of the parameters.
We can compute Fisher information using the formula shown below:
Here,
Alternatively, we can write down the variance of a variable
Initially, in most probabilistic applications, we have little information about how true the parameter values are of our model presumptuously operates upon. An example is neural networks where we have few clues regarding the model parameters. However, we instantiate the training process with a reasonable approximation of the parameter values.
For explanation, let's consider an example of a neuron that is set up to be trained to predict the number of fish in a pond on the basis of input features.
Likelihood answers the question about how likely is a certain parameter value concerning a certain output.
We can quantify likelihood as the following, for a given parameter value
For instance, let's suppose the predicted number of fish to be
It is convenient to take the logarithm of the likelihood function for the ease of differentiation concerning the parameter value. Also, a goal of training is to reach the optimal point where parameters are closest to their true form.
This, when modeled onto a graph, is a maximizing problem, and we need to reach the point where we achieve this maxima. We always ensure this concavity by taking the log. whereas it can't be promised in the naturally occurring likelihood distribution.
When we take the first derivative of the log-likelihood with respect to
Conceptually, as per our example, for
We can perceive the derivative of log-likelihood as just another probabilistic variable that can be modeled by a probability distribution. Consequently, it also possesses variance that can be computed.
Variance provides intuition into the spread associated with the rate of change of log-likelihood, with respect to
On the whole, a higher variance indicates that the output contains lower information about the value of the true parameter. This gives us the basis of parameter tuning in evolving techniques, such as the natural gradient descent.
Practically, Fisher information allows us to obtain ample information on how accurate a particular model is in terms of its proposed parameters. Hence, it follows that it is pivotal in determining how the parameters can be tuned to suit the distribution better.