Variational Inference: Gaussian Mixture model
Variational inference methods in Bayesian inference and machine learning are techniques which are involved in approximating intractable integrals. Most of the machine learning techniques involves finding the point estimates (Maximum likelihood estimation (MLE), Maximum a posteriori estimate (MAP)) of the latent variables, parameters which certainly do not account for any uncertainty. For taking account of uncertainty in point estimates, an entire posterior distribution over the latent variables and parameters is approximated.
For approximation of posterior distribution, which involves intractable integrals variational inference methods are applied for making the calculation tractable (having some kind of close form mathematically !!) i.e. which does not involve calculation of marginal distribution of the observed variables P(X).
The marginalisation over Z to calculate P(X) is intractable, because of the search space of Z is combinatorially large, thus we seek an approximation, using variational distribution Q(Z) with it’s own variational parameters.
Variational Inference as an Optimisation problem
Since we are trying to approximate a true posterior distribution p(Z|X) with Q(Z), a good choice of measure for measuring the dissimilarity between the true posterior and approximated posterior is Kullback–Leibler divergence (KL-divergence),
which is basically expectation of difference in log of two probability distribution with respect to approximated distribution. Considering KL divergence as an optimisation function, it needs to be minimised with respect to variational parameters of the variational distribution Q(Z). Since calculating KL divergence itself involves the calculation of posterior distribution P(Z|X), the calculation of dissimilarity in-terms of KL divergence needs to be expressed where we come across Evidence lower bound (ELBO). The above expression KL divergence can be written in terms of ELBO as
in which log evidence log P(X) is constant with respect to Q(Z), maximising the term L(Q) (Evidence lower bound) minimises KL divergence. By choosing an appropriate choice of Q(Z), L(Q) becomes maximizable and tractable to compute.
Mean field approximation
The mean field approximation assumes that all the hidden variables are independent of each other, which simplifies the joint distribution of hidden variables as the product of marginal distribution of each hidden variable.
where N is the no. of the hidden variables. It can be shown that best distribution qⱼ(Zⱼ) for maximising ELBO can be estimated as
which is expectation of log joint probability distribution of observed and latent variables taken over variables which do not include Zⱼ. This type above equations can be simplified into functions of hyper-parameters of the prior distribution over the latent variable and expectation of the latent variables which is not being calculated. These type of equation create circular between the parameters of the distribution over the variables being calculated and expectation of variables not being calculated, this section of circular dependencies will be more clear in the following section. This circular dependence of distribution of parameters of the variational distribution suggest an iterative update procedure same as Expectation-Maximisation algorithm.
Variational inference in Gaussian mixture model
Consider the above Bayesian Gaussian mixture model in plate notation, where square plates denotes the hyper-parameters, large circular plates denotes latent variables and filled-in objects denotes known values.
Here we are considering a finite Bayesian Gaussian mixture model with K components for a data-set of N X D dimension data-set with unknown parameters (mean vectors and precision matrix). For a graphical model to be in Bayesian setting, we need to define some prior (initial believes) on the unknown parameters for component weights (π), mean vectors (μ_{i=1…K}), precision matrix (Λ_{i=1…K})and latent variable (Z = z₁,…zn) for each of the observation present in the data, representing to which component does the observation belong to¹.
- K-dimension Symmetric Dirichlet distribution prior for components weights, with each hyper-parameter set to α₀.
- Wishart distribution prior on the precision matrix with known hyper-parameters (W₀, ν₀).
- Categorical distribution, for each of the latent variable with K-dimensional vector in which only one of the element is 1, rest are 0.
- Multivariate Normal Gaussian distribution on the means of components on of the mixture model, with known hyper parameter μ₀,β₀.
The joint probability distribution of the variables can be written as
where the factorisation of joint probability distribution into individual distribution is dependent upon the way the graphical model is described. Individual factors can be written as:
where
where ν is the degree of freedom and D is the dimension of the scale matrix for Wishart distribution and the dimensionality of each data point.
Since we have to approximate true posterior distribution P(Z,μ,Λ,π | X) with a variational distribution of latent variables i.e. with Q(Z,μ,Λ,π). The variational distribution can be factorised into individual factors according to the assumption of mean-field approximation.
Now using the equation for the best variation distribution for each parameters, first let us calculate for Q*(Z), we focus only on the terms containing Z and take expectation over the other latent variable which is as follows:
where, ⟨ ⟩ symbol denotes the expectation over the latent variables³ which are in the subscript of it. Substituting the expression for p(X ∣ Z, μ, Λ), p(Z ∣ π) in the above expression we get
where rₙₖ is expressed as
Thus the best q*(Z) is product of categorical distribution for each of latent variables with parameters rₙₖ for k = 1…K.
Now similarly we evaluate best variational distribution for of the other variables and according to our graphical model we can express Q(π,μ,Λ) as product of Q(π) and Q(μ,Λ). Then we can write,
taking exponential on the both sides of the above equation, this formulation is the form of Dirichlet distribution q*(π) ~ Dir(α), with parameters
Finally, we can derive best variational distribution parameters for q*(μₖ,Λₖ)
after some mathematical manipulations, the variation distribution assumes a form of Gaussian-Wishart distribution which is given by:
with parameters mₖ, βₖ, Wₖ, νₖ which are defined as
All the parameters defined above ( mₖ, βₖ, Wₖ, νₖ) are somehow dependent upon the categorical distribution parameter rₙₖ, which in turn is also dependent upon the above defined parameters. For evaluation rₙₖ, certain expression need to expressed which are as follows:
where ψ is log derivative of multinomial gamma function. The circular dependence of these parameters of variational distribution leads iterative update procedure of these parameters that alternates between two steps and is guaranteed to converge with respect to maximisation of ELBO or the minimisation of KL divergence between the true posterior and the approximate posterior over the latent variables. The iterative update procedure which alternate between two steps are as follows:
- An E-step (Estimation step) that computes the value of rₙₖ using the current values of all other parameters.
- An M-step (Update step) that uses new rₙₖ value to update all other parameters.
Evidence lower bound calculation
For the variational mixture of Gaussian the lower bound is given by²:
The various terms involved in the calculation of bounds are as follows:
where C(α) and B(W,ν) were defined earlier and H[q(Λₖ)] is the entropy of the Wishart distribution. For the convergence of the algorithm Evidence lower bound is to be taken as the convergence criterion, i.e. the algorithm is set to converge when there is no significant improvement in the convergence criterion.
Implementation of the above algorithm to a real life data-set
Considering the scaled version of Old faithful data-set, which when graphically plotted, depicts only two cluster present within the data-set.
On application of the above Variational Expectation-Maximisation algorithm on the data-set with K=6 components, with suitable values of hyper-parameters and proper initialisation of the parameters of variational distribution, we see that the algorithm converges with only two components as expected.
The above plots depicts cluster assignment of data points to cluster at various iteration, in which no. of colours present within a plot depicts number of total cluster. From the last plot it is evident that the algorithm converges with cluster only. For the below plot we can see that the evidence lower bound converges after 100 more iterations being close to zero, which shows that algorithm was able to approximate the true posterior distribution of the latent variables with variational distribution.
The code for the above implementation is available at:
Github URL: https://github.com/Ashu9808/Variational-inference-Gaussian-mixture-model
[1]: Variational Bayes methods (https://en.wikipedia.org/wiki/Variational_Bayesian_methods)
[2]: C. Bishop: Pattern Recognition and Machine Learning, Springer, 2010 (Ch. 9–11 & Appendix B)
[3]: http://sap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/pattern_recognition/2017/20170606-npb-gmm.pdf