Computing Continuous Posterior Distribution
In this lesson we will try answering the question: what is the continuous posterior when we are given an observation of the discrete result?
We'll cover the following
In the previous lesson, we posed and solved a problem in Bayesian reasoning involving only discrete distributions, and then proposed a variation on the problem whereby we change the prior distribution to a continuous distribution while preserving that the likelihood function produced a discrete distribution.
Continuous Posterior Given an Observation of Discrete Result
The question is: what is the continuous posterior when we are given an observation of the discrete result?
More specifically, the problem we gave was: suppose we have a prior in the form of a process which produces random values between and . We sample from that process and produce a coin that is heads with the given probability. We flip the coin; it comes up heads. What is the posterior distribution of coin probabilities?
Here’s one way to think about it: Suppose we stamp the probability of the coin coming up heads onto the coin. We mint and then flip a million of those coins once each. We discard all the coins that came up tails. The question is: what is the distribution of probabilities stamped on the coins that came up heads?
Let’s remind ourselves of Bayes’ Theorem. For prior and likelihood function , that the posterior is:
Remembering of course that is logically a function that takes a and returns a distribution of and similarly for .
But so far we’ve only seen examples of Bayes’ Theorem for discrete distributions. Fortunately, it turns out that we can do almost the same arithmetic in our weighted non-normalized distributions and get the correct result.
A formal presentation of the continuous version of Bayes’ Theorem and proving that it is correct would require some calculus and distract from where we want to go in this lesson, so we are just going to wave my hands here. Rest assured that we could put this on a solid theoretical foundation if we chose to.
Let’s think about this in terms of our type system. If we have a prior:
IWeightedDistribution<double> prior = // whatever;
and a likelihood:
Func<double, IWeightedDistribution<Result>> likelihood = // whatever;
then what we want is a function:
Func<Result, IWeightedDistribution<double>> posterior = // ???
Let’s suppose our result is Heads
. The question is: what is posterior(Heads).Weight(d)
equal to, for any d
we care to choose? We just apply Bayes’ Theorem on the weights. That is, this expression should be equal to:
prior.Weight(d) * likelihood(d).Weight(Heads) / ???.Weight(Heads)
We have a problem; we do not have an IWeightedDistribution<Result>
to get Weight(Heads)
from to divide through.
That is: we need to know what the probability is of getting Heads
if we sample a coin from the mint, flip it, and do not discard Heads
.
We could estimate it by repeated computation. We could call:
likelihood(prior.Sample()).Sample()
a billion times; the fraction of them that are Heads
is the weight of Heads
overall.
That sounds expensive though. Let’s give this some more thought.
Whatever the Weight(Heads)
is, it is a positive constant, right? And we have already abandoned the requirement that weights have to be normalized so that the area under the curve is exactly .
Positive constants do not affect proportionality.
We do not need to compute the denominator at all to solve a continuous Bayesian inference problem; we just assume that the denominator is a positive constant, and so we can ignore it.
So posterior(Heads)
must produce a distribution such that posterior(Heads).Weight(d)
is proportional to:
prior.Weight(d) * likelihood(d).Weight(Heads)
But that is just a non-normalized weight function, and we already have the gear to produce a distribution when we are given a non-normalized weight function; we can use our Metropolis
class from two lessons ago. It can take a non-normalized weight function, an initial distribution, and a proposal distribution, and produce a weighted distribution from it.
Notice that we don’t even need a distribution that we can sample from; all we need is its weight function.
That was all very abstract, so let’s look at the example we proposed last time: a mint with poor quality control produces coins with a particular probability of coming up heads; that’s our prior.
Therefore we’ll need a PDF that has zero weight for all values less than and greater than . We don’t care if it is normalized or not.
Remember, this distribution represents the quality of coins that come from the mint, and the value produced by sampling this distribution is the bias of the coin towards heads, where is “double-tailed” and is “double-headed”.
Beta Distribution
Let’s choose a beta distribution.
Get hands-on with 1400+ tech skills courses.