Computer Vision: Models, Learning, and Inference (2012)
Part I
Probability
Chapter 4
Fitting probability models
This chapter concerns fitting probability models to data {xi}Ii=1. This process is referred to as learning because we learn about the parameters θ of the model.1 It also concerns calculating the probability of a new datum x* under the resulting model. This is known as evaluating the predictive distribution. We consider three methods: maximum likelihood, maximum a posteriori, and the Bayesian approach.
4.1 Maximum likelihood
As the name suggests, the maximum likelihood (ML) method finds the set of parameters under which the data {xi}Ii=1 are most likely. To calculate the likelihood function Pr(xi|θ) at a single data point xi, we simply evaluate the probability density function at xi. Assuming each data point was drawn independently from the distribution, the likelihood function Pr(x1…I|θ) for a set of points is the product of the individual likelihoods. Hence, the ML estimate of the parameters is
where argmaxθ f[θ] returns the value of θ that maximizes the argument f[θ].
To evaluate the predictive distribution for a new data point x* (compute the probability that x* belongs to the fitted model), we simply evaluate the probability density function Pr(x*|) using the ML fitted parameters .
4.2 Maximum a posteriori
In maximum a posteriori (MAP) fitting, we introduce prior information about the parameters θ. From previous experience we may know something about the possible parameter values. For example, in a time-sequence the values of the parameters at time t tell us a lot about the possible values at time t+1, and this information would be encoded in the prior distribution.
As the name suggests, maximum a posteriori estimation maximizes the posterior probability Pr(θ|x1…I) of the parameters
where we have used Bayes’ rule between the first two lines and subsequently assumed independence. In fact, we can discard the denominator as it is constant with respect to the parameters and so does not affect the position of the maximum, and we get
Comparing this to the maximum likelihood criterion (Equation 4.1), we see that it is identical except for the additional prior term; maximum likelihood is a special case of maximum a posteriori where the prior is uninformative.
The predictive density (probability of a new datum x* under the fitted model) is again calculated by evaluating the pdf Pr(x*|) using the new parameters.
4.3 The Bayesian approach
In the Bayesian approach we stop trying to estimate single fixed values (point estimates) of the parameters θ and admit what is obvious; there may be many values of the parameters that are compatible with the data. We compute a probability distribution Pr(θ|x1…I) over the parameters θ based on data {xi}Ii=1 using Bayes’ rule so that
Evaluating the predictive distribution is more difficult for the Bayesian case since we have not estimated a single model but have instead found a probability distribution over possible models. Hence, we calculate
which can be interpreted as follows: the term Pr(x*|θ) is the prediction for a given value of θ. So, the integral can be thought of as a weighted sum of the predictions given by different parameters θ, where the weighting is determined by the posterior probability distribution Pr(θ|x1…I) over the parameters (representing our confidence that different parameter values are correct).
The predictive density calculations for the Bayesian, MAP, and ML cases can be unified if we consider the ML and MAP estimates to be special probability distributions over the parameters where all of the density is at . More formally, we can consider them as delta functions centered at . A delta function δ[z] is a function that integrates to one, and that returns zero everywhere except at z = 0. We can now write
which is exactly the calculation we originally prescribed: we simply evaluate the probability of the data under the model with the estimated parameters.
4.4 Worked example 1: Univariate normal
To illustrate the above ideas, we will consider fitting a univariate normal model to scalar data {xi}Ii=1. Recall that the univariate normal model has pdf
and has two parameters, the mean μ and the variance σ2. Let us generate I independent data points {xi}Ii=1 from a univariate normal with μ = 1 and σ2 = 1. Our goal is to reestimate these parameters from the data.
4.4.1 Maximum likelihood estimation
The likelihood Pr(x1…I|μ, σ2) of the parameters {μ, σ2} for observed data {xi}Ii=1 is computed by evaluating the pdf for each data point separately and taking the product:
Figure 4.1 Maximum likelihood fitting. The likelihood of the parameters for a single datapoint is the height of the pdf evaluated at that point (blue vertical lines). The likelihood of a set of independently sampled data is the product of the individual likelihoods. a) The likelihood for this normal distribution is low because the large variance means the height of the pdf is low everywhere. b) The likelihood for this normal distribution is even lower as the left-most datum is very unlikely under the model. c) The maximum likelihood solution is the set of parameters for which the likelihood is maximized.
Figure 4.2 The likelihood function for a fixed set of observed data is a function of the mean μ and variance σ2 parameters. The plot shows that there are many parameter settings which might plausibly be responsible for the ten data points from Figure 4.1. A sensible choice for the “best” parameter setting is the maximum likelihood solution (green cross), which corresponds to the maximum of this function.
Obviously, the likelihood for some sets of parameters {μ, σ2} will be higher than others (Figure 4.1), and it is possible to visualize this as a 2D function of the mean μ and variance σ2 (Figure 4.2). The maximum likelihood solution , will occur at the peak of this surface so that
In principle we can maximize this by taking the derivative of Equation 4.8 with respect to μ and σ2, equating the result to zero and solving. In practice, however, the resulting equations are messy. To simplify things, we work instead with the logarithm of this expression (the log likelihood, L). Since the logarithm is a monotonic function (Figure 4.3), the position of the maximum in the transformed function remains the same. Algebraically, the logarithm turns the product of the likelihoods of the individual data points into a sum and so decouples the contribution of each. The ML parameters can now be calculated as
To maximize, we differentiate this log likelihood L with respect to μ and equate the result to zero
Figure 4.3 The logarithm is a monotonic transformation. If one point is higher than another, then it will also be higher after transformation by the logarithmic function. It follows that if we transform the surface in Figure 4.2 through the logarithmic function, the maximum will remain in the same position.
and rearranging, we see that
By a similar process, the expression for the variance can be shown to be
These expressions are hardly surprising, but the same idea can be used to estimate parameters in other distributions where the results are less familiar.
Figure 4.1 shows a set of data points and three possible fits to the data. The mean of the maximum likelihood fit is the mean of the data. The ML fit is neither too narrow (giving very low probabilities to the furthest data points from the mean) nor too wide (resulting in a flat distribution and giving low probability to all points).
Least squares fitting
As an aside, we note that many texts discuss fitting in terms of least squares. Consider fitting just the mean parameter μ of the normal distribution using maximum likelihood. Manipulating the cost function so that
leads to a formulation where we minimize the sum of squares. In other words, least squares fitting is equivalent to fitting the mean parameter of a normal distribution using the maximum likelihood method.
4.4.2 Maximum a posteriori estimation
Returning to the main thread, we will now demonstrate maximum a posteriori fitting of the parameters of the normal distribution. The cost function becomes
where we have chosen normal inverse gamma prior with parameters α, β, γ, δ (Figure 4.4) as this is conjugate to the normal distribution. The expression for the prior is
The posterior distribution is proportional to the product of the likelihood and the prior (Figure 4.5), and has the highest density in regions that both agree with the data and were a priori plausible.
Like the ML case, it is easier to maximize the logarithm of Equation 4.15:
Figure 4.4 Prior over normal parameters. a) A normal inverse gamma with α, β, γ = 1 and δ = 0 gives a broad prior distribution over univariate normal parameters. The magenta cross indicates the peak of this prior distribution. The blue crosses are five samples randomly drawn from the distribution. b) The peak and the samples can be visualized by plotting the normal distributions that they represent.
Figure 4.5 MAP inference for normal parameters. a) The likelihood function is multiplied by b) the prior probability to give a new function c) that is proportional to the posterior distribution. The maximum a posteriori (MAP) solution (cyan cross) is found at the peak of the posterior distribution. It lies between the maximum likelihood (ML) solution (green cross) and the maximum of the prior distribution (MP, magenta cross).
To find the MAP parameters, we substitute in the expressions, differentiate with respect to μ and σ, equate to zero, and rearrange to give
The formula for the mean can be more easily understood if we write it as
This is a weighted sum of two terms. The first term is the data mean and is weighted by the number of training examples I. The second term is δ, the value of μ favored by the prior, and is weighted by γ.
This gives some insight into the behavior of the MAP estimate (Figure 4.6). With a large amount of data, the first term dominates, and the MAP estimate is very close to the data mean (and the ML estimate). With intermediate amounts of data, is a weighted sum of the prediction from the data and the prediction from the prior. With no data at all, the estimate is completely governed by the prior. The hyperparameter (parameter of the prior) γ controls the concentration of the prior with respect to μ and determines the extent of its influence. Similar conclusions can be drawn about the MAP estimate of the variance.
Where there is a single data point (Figure 4.6e-f), the data tells us nothing about the variance and the maximum likelihood estimate 2 is actually zero; the best fit is an infinitely thin and infinitely tall normal distribution centered on the one data point. This is unrealistic, not least because it accords the datum an infinite likelihood. However, MAP estimation is still valid as the prior ensures that sensible parameter values are chosen.
Figure 4.6 Maximum a posteriori estimation. a) Posterior distribution over parameters μ and σ2. MAP solution (cyan cross) lies between ML (green cross) and the peak of the prior (purple cross). b) Normal distributions corresponding to MAP solution, ML solution and peak of prior. c-d) With fewer data points, the prior has a greater effect on the final solution. e-f) With only one data point, the maximum likelihood solution cannot be computed (you cannot calculate the variance of a single point). However, the MAP solution can still be calculated.
4.4.3 The Bayesian approach
In the Bayesian approach, we calculate a posterior distribution Pr(μ, σ2|x1…I) over possible parameter values using Bayes’ rule,
where we have used the conjugate relationship between likelihood and prior (Section 3.9) and κ is the associated constant. The product of the normal likelihood and normal inverse gamma prior creates a posterior over μ and σ2, which is a new normal inverse gamma distribution and can be shown to have parameters
Figure 4.7 Bayesian predictions. a) Posterior probability distribution over parameters. b) Samples from posterior probability distribution correspond to normal distributions. c) The predictive distribution for the Bayesian case is the average of an infinite set of samples. Alternately, we can think of choosing the parameters from a uniform distribution and computing a weighted average where the weights correspond to the posterior distribution.
Note that the posterior (left-hand side of Equation 4.20) must be a valid probability distribution and sum to one, so the constant κ from the conjugate product and the denominator from the right-hand side must exactly cancel to give
Now we see the major advantage of using a conjugate prior: we are guaranteed a closed form expression for the posterior distribution over the parameters.
This posterior distribution represents the relative plausibility of various parameter settings μ and σ2 having created the data. At the peak of the distribution is the MAP estimate, but there are many other plausible configurations (Figure 4.6).
When data are plentiful (Figure 4.6a), the parameters are well specified, and the probability distribution is concentrated. In this case, placing all of the probability mass at the MAP estimate is a good approximation to the posterior. However, when data are scarce (Figure 4.6c), many possible parameters might have explained the data and the posterior is broad. In this case approximation with a point mass is inadequate.
Predictive density
For the maximum likelihood and MAP estimates, we evaluate the predictive density (probability that a new data point x* belongs to the same model) by simply evaluating the normal pdf with the estimated parameters. For the Bayesian case, we compute a weighted average of the predictions for each possible parameter set, where the weighting is given by the posterior distribution over parameters (Figures 4.6a-c and 4.7),
Here we have used the conjugate relation for a second time. The integral contains a constant with respect to μ and σ2 multiplied by a probability distribution. Taking the constant outside the integral, we get
Figure 4.8 a-c) Predictive densities for MAP and Bayesian approaches with 50, 5, and 1 training examples. As the training data decreases, the Bayesian prediction becomes less certain but the MAP prediction is erroneously overconfident. d-f) This effect is even more clear on a log scale.
which follows because the integral of a pdf is one. It can be shown that the constant is given by
where
Here, we see the second advantage of using the conjugate prior; it means that the integral can be computed, and so we get a nice closed form expression for the predictive density.
Figure 4.8 shows the predictive distribution for the Bayesian and MAP cases, for varying amounts of training data. With plenty of training data, they are quite similar but as the amount of data decreases, the Bayesian predictive distribution has a significantly longer tail. This is typical of Bayesian solutions: they are more moderate (less certain) in their predictions. In the MAP case, erroneously committing to a single estimate of μ and σ2 causes overconfidence in our future predictions.
Figure 4.9 a) Categorical probability distribution over six discrete values with parameters {λk}6k=1 where ∑6k=1 λk=1. This could be the relative probability of a biased die landing on its six sides. b) Fifteen observations {xi}Ii=1 randomly sampled from this distribution. We denote the number of times category k was observed by Nk so that here the total observations ∑6k=1 Nk = 15.
4.5 Worked example 2: Categorical distribution
As a second example, we consider discrete data {xi}Ii=1 where xi ∈ {1,2,…,6} (Figure 4.9). This could represent observed rolls of a die with unknown bias. We will describe the data using a categorical distribution (normalized histogram) where
For the ML and MAP techniques, we estimate the six parameters {λk}6k=1. For the Bayesian approach, we compute a probability distribution over the parameters.
4.5.1 Maximum Likelihood
To find the maximum likelihood solution, we maximize the product of the likelihoods for each individual data point with respect to the parameters λ1…6.
where Nk is the total number of times we observed bin k in the training data. As before, it is easier to maximize the log probability, and we use the criterion
where the second term uses the Lagrange multiplier ν to enforce the constraint on the parameters ∑6k=1λk = 1. We differentiate L with respect to λk and ν, set the derivatives equal to zero, and solve for λk to obtain
In other words, λk is the proportion of times that we observed bin k.
4.5.2 Maximum a posteriori
To find the maximum a posteriori solution we need to define a prior. We choose the Dirichlet distribution as it is conjugate to the categorical likelihood. This prior over the six categorical parameters is hard to visualize but samples can be drawn and examined (Figure 4.10a-e). The MAP solution is given by
which is again subject to the constraint that ∑6k=1 λk = 1. As in the maximum likelihood case, this constraint is enforced using a Lagrange multiplier. The MAP estimate of the parameters can be shown to be
where Nk is the number of times that observation k occurred in the training data. Note that if all the values αk are set to one, the prior is uniform and this expression reverts to the maximum likelihood solution (Equation 4.30).
4.5.3 Bayesian Approach
In the Bayesian approach, we calculate a posterior over the parameters
where k = Nk + αk. We have again exploited the conjugate relationship to yield a posterior distribution with the same form as the prior. The constant κ must again cancel with the denominator to ensure a valid probability distribution on the left-hand side. Samples from this distribution are shown in Figure 4.10f-j.
Predictive Density
For the ML and MAP estimates we compute the predictive density (probability that a new data point x* belongs to the same model) by simply evaluating the categorical pdf with the estimated parameters. With the uniform prior (α1…6 = 1) the MAP and ML predictions are identical (Figure 4.11a) and both are exactly proportional to the frequencies of the observed data.
Figure 4.10 a-e) Five samples drawn from Dirichlet prior with hyperparameters α1…6 = 1. This defines a uniform prior, so each sample looks like a random unstructured probability distribution. f-j) Five samples from Dirichlet posterior. The distribution favors histograms where bin three is larger and bin four is small as suggested by the data.
For the Bayesian case, we compute a weighted average of the predictions for each possible parameter set, where the weighting is given by the posterior distribution over parameters so that
Here, we have again exploited the conjugate relationship to yield a constant multiplied by a probability distribution and the integral is simply the constant as the integral of the pdf is one. For this case, it can be shown that
This is illustrated in Figure 4.11b. It is notable that once more the Bayesian predictive density is less confident than the ML/MAP solutions. In particular, it does not allot zero probability to observing x* = 4 despite the fact that this value was never observed in the training data. This is sensible; just because we have not drawn a 4 in 15 observations does not imply that it is inconceivable that we will ever see one. We may have just been unlucky. The Bayesian approach takes this into account and allots this category a small amount of probability.
Figure 4.11 Predictive distributions with α1…6 = 1 for a) maximum likelihood / maximum a posteriori approaches and b) Bayesian approach. The ML/MAP approaches predict the same distribution that exactly follows the data frequencies. The Bayesian approach predicts a more moderate distribution and allots some probability to the case x = 4 despite having seen no training examples in this category.
Summary
We presented three ways to fit a probability distribution to data and to predict the probability of new points. Of the three methods discussed, the Bayesian approach is the most desirable. Here it is not necessary to find a point estimate of the (uncertain) parameters, and so errors are not introduced because this point estimate is inaccurate.
However, the Bayesian approach is only tractable when we have a conjugate prior, which makes it easy to calculate the posterior distribution over the parameters Pr(θ|x1…I) and also to evaluate the integral in the predictive density. When this is not the case, we will usually have to rely on maximum a posteriori estimates. Maximum likelihood estimates can be thought of as a special case of maximum a posteriori estimates in which the prior is uninformative.
Notes
For more information about the Bayesian approach to fitting distributions consult Chapter 3 of Gelman et al. (2004). More information about Bayesian model selection (Problem 4.6), including an impassioned argument for its superiority as a method of hypothesis testing can be found in Mackay (2003).
Problems
4.1 Show that the maximum likelihood solution for the variance σ2 of the normal distribution is given by
4.2 Show that the MAP solution for the mean μ and variance σ2 of the normal distribution are given by
when we use the conjugate normal-scaled inverse gamma prior
4.3 Taking Equation 4.29 as a starting point, show that the maximum likelihood parameters for the categorical distribution are given by
where Nk is the number of times that category K was observed in the training data.
4.4 Show that the MAP estimate for the parameters {λ}Kk=1 of the categorical distribution is given by
under the assumption of a Dirichlet prior with hyperparameters {αk}Kk=1.The terms Nk again indicate the number of times that category k was observed in the training data.
4.5 The denominator of Bayes’ rule
is known as the evidence. It is a measure of how well the distribution fits regardless of the particular values of the parameters. Find an expression for the evidence term for (i) the normal distribution and (ii) the categorical distribution assuming conjugate priors in each case.
4.6 The evidence term can be used to compare models. Consider two sets of data 1 = { 0.1,−0.5,0.2,0.7} and 2 = {1.1,2.0,1.4,2.3}. Let us pose the question of whether these two data sets came from the same normal distribution or from two different normal distributions.
Let model M1 denote the case where all of the data comes from the one normal distribution. The evidence for this model is
where θ = {μ, σ2} contains the parameters of this normal distribution. Similarly, we will let M2 denote the case where the two sets of data belong to different normal distributions
where θ1 = {μ1, σ21} and θ2 = {μ2, σ22}.
Now it is possible to compare the probability of the data under each of these two models using Bayes’ rule
Use this expression to compute the posterior probability that the two datasets came from the same underlying normal distribution. You may assume normal-scaled inverse gamma priors over θ, θ1, and θ2 with parameters α = 1, β = 1, γ = 1, δ = 0.
Note that this is (roughly) a Bayesian version of the two-sample t-test, but it is much neater - we get a posterior probability distribution over the two hypotheses rather than the potentially misleading p value of the t-test. The process of comparing evidence terms in this way is known as Bayesian model selection or the evidence framework. It is rather clever in that two normal distributions fitted with maximum likelihood will always explain the data better than one; the additional parameters simply make the model more flexible. However because we have marginalized these parameters away here, it is valid to compare these models in the Bayesian case.
4.7 In the Bernoulli distribution, the likelihood Pr(x1…I|λ) of the data {xi}Ii=1 given parameter λ where xi ∈ {0, 1} is
Find an expression for the maximum likelihood estimate of the parameter λ.
4.8 Find an expression for the MAP estimate of the Bernoulli parameter λ (see Problem 4.7) assuming a beta distributed prior
4.9 Now consider the Bayesian approach to fitting Bernoulli data, using a beta distributed prior. Find expressions for (i) the posterior probability distribution over the Bernoulli parameters given observed data {xi}Ii=1 and (ii) the predictive distribution for new data x*.
4.10 Staying with the Bernoulli distribution, consider observing data 0,0,0,0 from four trials. Assuming a uniform beta prior (α = 1, β = 1), compute the predictive distribution using the (i) maximum likelihood, (ii) maximum a posteriori, and (iii) Bayesian approaches. Comment on the results.
___________________________________
1 Here we adopt the notation θ to represent a generic set of parameters when we have not specified the particular probability model.