GSoC is approaching its end. I am very glad to have such great experience this summer. I explored the classical machine learning models, Gaussian mixture models (GM), Bayesian Gaussian mixture models with variational inferences (BGM), and Dirichlet Process Gaussian mixture (DPGM). The code and doc is in PR4802.

Besides these issues, I did some animations and IPN for these three models.

In conclusion, I finished the tasks of in the proposal, but I didn’t have time to do the optional tasks, i.e., the incremental EM algorithm and different covariance estimators. Anyway, after GSoC, I will continue to contribute to the scikit-learn project.

My mentor gave me some useful advices after I finished all the codes of BayesianGaussianMixture and DirichletProcessGaussianMixture. So in these two weeks, I fixed the style problems and did all the necessary test cases for BayesianGaussianMixture. I also did the visualization of Gaussian mixture with variational inference for four types of precision using matplotlib.animation, link

Next step, I will explore some optional tasks which are incremental learning and other covariance estimators besides the test cases of DirichletProcessGaussianMixture.

Week 8 and 9

In the week 8 and 9, I implemented DirichletProcessGaussianMixture. But its behavior looks similar to BayesianGaussianMixture. Both of them can infer the best number of components. DirichletProcessGaussianMixture took a slightly more iteration than BayesianGaussianMixture to converge on Old-faith data set, around 60 iterations.

If we solve Dirichlet Process Mixture by Gibbs sampling, we don’t need to specify the truncated level T. Only the concentration parameter $\alpha$ is enough. In the other hand, with variational inference, we still need to specify the maximal possible number of components, i.e., the truncated level.

At the first, the lower bound of DirichletProcessGaussianMixture seems a little strange. It is not always going up. When some clusters disappear, it goes down a little bit, then go up straight. I think it is because the estimation of the parameters is ill-posed when these clusters have data samples less than the number of features. I did the math derivation of Dirichlet process mixture models again, and found it was a bug on the coding of a very long equation.

I also finished the code of BayesianGaussianMixture for ‘tied’, ‘diag’ and ‘spherical’ precision.

My mentor pointed out the style problem in my code and docstrings. I knew PEP8 convention, but got no idea where was also a convention for docstring, PEP257. It took me a lot of time to fix the style problem.

Progress report 2

During the last 5 weeks (since the progress report 1), I finished the

  1. GaussianMixutre with four kinds of covariance
  2. Most test cases of GaussianMixutre
  3. BayesianGaussianMixture with four kinds of covariance
  4. DirichletProcessGaussianMixture

Although I spent some time on some unsuccessful attempts, such as decoupling out observation models and hidden models as mixin classes, double checking DP equations, I did finished the most essential part of my project and did some visualization. In the following 4 weeks, I will finish all the test cases for BayesianGaussianMixture and DirichletProcessGaussianMixture, and did some optional tasks, such as different covariance estimators and incremental GMM.

In the week 6 and 7, I coded BayesianGaussianMixture for the full covariance type. Now it can run smoothly on synthetic data and old-faithful data. Take a peek on the demo.

from sklearn.mixture.bayesianmixture import BayesianGaussianMixture as BGM
bgm = BGM(n_init=1, n_iter=100, n_components=7, verbose=2, init_params='random',
         precision_type='full')
bgm.fit(X)

BayesianGaussianMixture on old-faithful dataset. n_components=6, alpha=1e-3

The demo is to repeat the experiment of PRML, page 480, Figure 10.6. VB on BGMM has shown its capability of inferring the number of components automatically. It has converged in 47 iterations.

The lower bound of the log-likelihood, a.k.a ELBO

The ELBO looks a little weired. It is not always going up. When some clusters disappear, ELBO goes down a little bit, then go up straight. I think it is because the estimation of the parameters is ill-posed when these clusters have data samples less than the number of features.

The BayesianGaussianMixture has much more parameters than GaussianMixture, there are six parameters per each components. I feel it is not easy to control the so many functions and parameters. The initial design of BaseMixture is also not so good. I took a look at bnpy which is a more complicated implementation of VB on various mixture models. Though I don’t need to go such complicated implementation, but the decoupling of observation model, i.e. $X$, $\mu$, $\Lambda$, and mixture mode, i.e. $Z$, $\pi$ is quite nice. So I tried to use Mixin class to represent these two models. I split MixtureBase into three abstract classes ObsMixin, HiddenMixin and MixtureBase(ObsMixn, HiddenMixin). I also implemented subclasses for Gaussian Mixture ObsGaussianMixin(ObsMixin), MixtureMixin(HiddenMixin), GaussianMixture(MixtureBase, ObsGaussianMixin, MixtureMixin), but Python does allow me to do this due to there is correct MRO. :-|. I changed them back, but this unsuccessful experiment gives me a nice base class, MixtureBase.

I also tried to use cached_property to store the intermediate variables such as, $\ln \pi$, $\ln \Lambda$, and cholsky decomposed $ W^-1 $, but didn’t get much benefits. It is almost the same to save these variables as private attributes into instances.

The numerical issue comes from responsibility is extremely small. When estimating resp * log resp, it gives NAN. I simply avoid computing when resp < 10*EPS. Still, ELBO seems suspicious.

The current implementation of VBGMM in scikit-learn cannot learn the correct parameters on old-faithful data.

VBGMM(alpha=0.0001, covariance_type='full', init_params='wmc',
   min_covar=None, n_components=6, n_iter=100, params='wmc',
   random_state=None, thresh=None, tol=0.001, verbose=0)
 

It gives only one components. The weights_ is

 array([  7.31951611e-07,   7.31951611e-07,   7.31951611e-07,
         7.31951611e-07,   7.31951611e-07,   9.99996340e-01])
 

I also implemented DirichletProcessGaussianMixture. But currently it looks the same as BayesianGaussianMixture. Both of them can infer the best number of components. DirichletProcessGaussianMixture took a slightly more iteration than BayesianGaussianMixture. If we infer Dirichlet Process Mixture by Gibbs sampling, we don’t need to specify the truncated level, only alpha the concentration parameter is enough. But with variational inference, we still need the give the model the maximal possible number of components, i.e., the truncated level $T$.

The week 5 began with a discussion with whether we should deprecate params. I fixed some bugs in checking functions, random number generator and one of covariance updating methods. In the following days, I completed the main functions of GaussianMixutre and all test cases, except AIC, BIC and sampling functions. The tests are some kind of challenging, sine the current implementation in the master branch contains very old test cases imported from Weiss’s implementation which is never got improved. I simplified the test cases, and wrote more tests that are not covered by the current implementation, such as covariance estimation, ground truth parameter prediction, and other user-friendly warnings and errors.

Next week, I will begin to code BayesianGaussianMixture.