Evaluation of Learning Algorithms Using ROC Curve
Understand the parameters of Maximum Likelihood Estimation (MLE) and Maximum A-Posteriori Estimation (MAP) algorithms in Python and learn how to use ROC curve to evaluate their performance.
In this lesson, we will introduce the evaluation of the ROC curve for several parameter combinations, providing a quantitative measure of learning output performance. This analysis will help us understand not just the quality of the CPDs generated but also the predictive power and reliability of each algorithm.
First, we dive into the main parameters that we can use.
Bayes prior
The bayes_prior
parameter in the Bayesian estimation methods for learning the parameters of a Bayesian network refers to the type of prior distribution that is applied during the estimation process.
BDeu
as a parameter value:
BDeu stands for "Bayesian Dirichlet equivalent uniform."
It is a type of prior that treats all model structures (in terms of the CPDs) as equally likely a priori (hence the "uniform" part).
BDeu is designed to be equivalent across different network structures that encode the same assertions of conditional independence (which relates to the "equivalent" part).
The BDeu prior uses a parameter called the "equivalent sample size" which can be thought of as the weight given to the prior relative to the data. A larger equivalent sample size means the prior belief is stronger and influences the final estimate more, while a smaller equivalent sample size gives more weight to the observed data.
Using BDeu makes the Bayesian Estimator act like a MAP (Maximum A-Posteriori) estimator because it combines the likelihood of the observed data with the prior belief to arrive at the final parameter estimates. The MAP estimator does not just take the data at face value (as in the case of Maximum Likelihood Estimation) but adjusts the estimates based on the prior distribution. This can prevent overfitting to the observed data and help in situations where the data might be sparse or incomplete.
K2
as a parameter value:
K2 is a scoring function used for structure learning and not typically a prior in the same sense as BDeu.
However, in some implementations, setting
bayes_prior
toK2
for parameter learning can imply the use of a prior similar to what is used in the K2 structure learning algorithm.The K2 algorithm assumes that the attributes are ordered and the probability of a particular structure only depends on this ordering. The K2 scoring function is based on a product of Dirichlet distributions, similar to a Bayesian estimator.
As a prior for parameter estimation,
K2
would also imply a Dirichlet prior, but its configuration might differ from BDeu in terms of how much weight it gives to different potential structures and the equivalent sample size.
Selecting bayes_prior="BDeu"
invokes the use of a prior that is designed to work equivalently across different network structures, with a certain amount of "imagined" data points (equivalent sample size) contributing to the estimation process. This approach helps to ensure that the parameter estimates are not purely driven by the observed data, which is especially important in cases where data might be limited or certain outcomes are rare. This is in line with the principles of MAP estimation, which adjusts the parameter estimates based on prior knowledge or belief.
Equivalent sample size
The equivalent_sample_size
parameter in Bayesian estimation methods is a hyperparameter that influences the strength of the prior distribution in the estimation of the parameters of a Bayesian network.
Here's what it means:
Prior distribution: A prior distribution in Bayesian statistics represents prior beliefs or knowledge before observing any data. In the context of Bayesian networks, the prior distribution for a node's probabilities can be interpreted as hypothetical previous observations or counts that are combined with the actual data.
Equivalent sample size (ESS): The ESS represents how many hypothetical prior observations are considered when estimating the parameters. It's as if the dataset had an extra ESS number of observations, and these observations are distributed according to the prior belief about the parameters.
When you use a Bayesian estimator with a specified prior:
Small ESS: If the equivalent sample size is small relative to the actual data, the prior has less influence, and the data has more weight in determining the parameters. The extreme case, ESS = 0, corresponds to using only the data (equivalent to maximum likelihood estimation).
Large ESS: If the equivalent sample size is large, it means the prior beliefs have a substantial influence on the final parameter estimates. A very large ESS would make the actual data less significant, and the prior would dominate, pushing the parameter estimates towards the prior beliefs.
Therefore, the equivalent_sample_size
acts as a balance between the prior distribution and the likelihood of the observed data. It allows the incorporation of domain knowledge or regularization into the parameter estimation process. In practice, the choice of ESS can have a considerable impact on the learned parameters, especially when dealing with small datasets or when attempting to introduce a degree of robustness to the model against overfitting to the training data.
Evaluation of the learning algorithm
Now that we know how to use some learning algorithms, it is time to evaluate them. For this, we can use the ROC curve and the AUC. The following code has the same Bayesian network loaded.
Get hands-on with 1300+ tech skills courses.