Updated on July 1. Updated on Jun 24.

Here is the task check-list.

  1. Completes derivation report.
  2. Adds new classes. One abstract class _BasesMixture. Three derived classes GaussianMixture, BayesianGaussianMixture, DirichletProcessGaussianMixture
  3. Removes numerical stability fixes for HMM. It seems that whenever there is a numerical issue, the code always adds 10*EPS in the computation. I think in some cases there is a better way to address the problem, such as normalization the extremely small variables earlier, or we just simply remove 10*EPS which is only for HMM.
  4. Writes updating functions for BayesianGaussianMixtureModel and DirichletProcessGaussianMixtureModel according to the report.
  5. Provides methods that allow users to initialize the model with user-provided data
  6. Corrects kmeans initialization. It is weird when using kmeans initialization, only means is initialized. The weights and covariances are initialized by averaging.
  7. Writes several checking functions for the initialization data
  8. Adjusts the verbose messages. When verbose>1, it display log-likelihood and time used in the same line of the message Iteration x
  9. Simplify fit_predict
  10. Adds warning for params!='wmc'
  11. Studies and contrasts the convergence of classical MLE / EM GMM with Bayesian GMM against the number of samples and the number of components
  12. Friendly warning and error messages, or automatically addressing if possible (e.g. random re-init of singular components)
  13. Examples that shows how models can over-fit by comparing likelihood on training and validation sets (normalized by the number of samples). For instance extend the BIC score example with a cross-validated likelihood plot
  14. Testing on 1-D dimensions
  15. Testing on Degenerating cases
  16. AIC, BIC for VBGMM DPGMM
  17. Old faithful geyser data set
  18. Rename score_samples
  19. Add integrating test
  20. [optional] add a partial_fit function for incremental / out-of-core fitting of (classical) GMM, for instance http://arxiv.org/abs/0712.4273
  21. [optional] ledoit_wolf covariance estimation

The most important progress I have done is the derivation report which include the updating functions, log-probability, and predictive distribution for all three models, and the implementation of the base class. Compared with the current scikit-learn math derivation documents, my report is consistent to PRML. It clearly depicts the updating functions of three models share a lot of patterns. We could abstract common functions into the abstract base class _MixtureBase. The three models could inherit it and override the updating methods.

Next week I will finish the GaussianMixture model with necessary testing functions.

The week 3 has a very exciting start. I finished the derivation of DPGMM, as well as the lower bound and the predictive probability for each model.

The difference between my derivation and the current document is that the current models assume a simpler approximation. The model defined in PRML is more accurate and provides more knobs. The two approximations both appear in the literature. Maybe we should do some experiments to decide which one is better.

With regards to the new names of DPGMM and VBGMM, I think these two names are not suitable, just like someone calls SVM as SMO. Actually, the models are Bayesian GMM, Dirichlet Process Bayesian GMM (DPGMM is often used) respectively. Both of them are solved by variational inference. In other words, VBGMM is not a good name. The new names, I think, should have the meaning of ‘Bayesian GMM solved by VB’, ‘DP(B)GMM solved by VB’.

I also took a close look at the code base. The code is not maintained well. The problem I am going to address is as follows.

  • decouple some large functions, such as _fit
  • use abstract class and inheritance to reduce code redundancy
  • numerical stability. It seems that whenever there is a numerical issue. The code always like to add EPS. I think in some place there is a better way to address the problem, such as normalization the extremely small variables earlier.
  • write updating functions for BayesianGaussianMixtureModel and DirichletProcessGaussianMixtureModel
  • provide methods that allow users to initialize the model before fit
  • correct kmeans initialization. It is weird when using kmean initialization, only means is initialized. The weights and covariances are initialized by averaging.
  • write several checking functions for the initialization data
  • [optional] add a partial_fit function for incremental / out-of-core fitting of (classical) GMM, for instance http://arxiv.org/abs/0712.4273
  • [optional] ledoit_wolf covariance estimation

The last days of this week I implemented the structure of new classes. _MixtureModelBase, GaussianMixtureModel, BayesianMixtureModel, DirichletProcessMixtureModel. It provides us a big picture of the classes I am going to implement. I am looking forward the feedback.

VBGMM

I finally finish writing all derivations and equations for VBGMM with LaTex. Download the derivation draft. It is crazy to write equations of 12 pages in blog. So I wrote them in a traditional LaTex file. Typing math equations is always pain. I have to be careful with mathbf, boldsymbol, subscripts. It is boring and not cool to type \mathbf, \boldsumbol, \mu, \Lambda again and again. There are 440 occurrences of boldsymbol :|. So I created several snippets in Sublime for typing these LaTex commands. I also learned some interesting advanced LaTex techniques. There are so many extremely long equations have 9 or 10 terms, and each team is either frac or horrible $\sum$ or the productions of vectors. Environments split, align, aligned, empheq are very helpful to type those LaTex words.

Well, they are not big deals. The most important thing is there is no derivations for VBGMM in the current sklearn docs. We only found a doc about derivation for DPGMM. Yes, I am done with VBGMM if there is no mistake after double-checking, and I am going to study DPGMM. There is a little difference in the problem setting.

In the current doc, which is not the same as the setting in PRML I think the difference will make the final updating functions are different from the current implementations.

The trick about those derivation is ‘completing the square’, which is identify the second-order terms and one-order terms in the equations, and use the coefficients before these terms to ‘make’ the probability density function we want, then normalize it to absorb other constants in the equations.

GMM API

After a stupid trying of deprecating old GMMM class, I created a new GaussianMixtureModel to keep the naming conventions, and re-implement old GMM module inheriting from GaussianMixtureModel. The new GaussianMixtureModel has reasonable score, score_samples API which is coherent with other modules of sklearn. The new DensityMixin class implements score and serves a mixin class for all current and future density estimators. Mixing class technique is cool. I never heard this before I dig into the code base of sklearn.

Next Week

I hope I could finish the derivations of DPGMM, and clean up GMM API.

VBGMM

This week, I studied variational inference described in Chapter 10 of Pattern Recognition and Machine Learning (PRML) on GMM model. I derived the updating functions of VBGMM with “full” type covariance matrix. There are so many equations. Download the file from Dropbox. Currently, I have done the updating functions with other three covariance matrix “tied”, “diag” and “sphere”, but I have not typed into the latex file yet.

I also studied the adjustment of GMM API. The discussion on issue #2473, #4062 points out the inconsistency on score_ sample, score. So I changed and made a new API interface of some functions in the ipython notebook.

It is fortunate that my proposal about Gaussian mixture model is accepted by Google Summer of Code 2015. I am very grateful to scikit-learn, Python Software Foundation and Google Summer of Code. As a PhD student studying in Machine Learning and Data Mining, I frequently process various kinds of data using Matlab, Python and scikit-learn. Scikit-learn is a powerful and easy-to- use machine learning library for Python. Though I only have been using it for about one year, I cannot leave it in my many projects now.

I first heard of GSoC in 2012, when my colleague pluskid participated in Shogun project. The post he wrote about his experience is quite interesting and fun. Since I missed GSoC 2014 because of too much course projects, I began to read some code of scikit-learn and learn git. Anyway, I really looking forward to a wonderful journey this summer.

Introduction

This summer, I focus on Gaussian mixture model and other two variances. Compared with other two GSoC projects, my project looks a bit different, since it is kind of fixing / refactoring rather than introducing new features. The following text is from my proposal.

Gaussian mixture model (GMM) is a popular unsupervised clustering method. GMM corresponds a linear combination of several Gaussian distributions to represent the probability distribution of observations. In GMM, with the prefix number of Gaussian component, a set of parameters should be estimated to represent the distribution of the training data. It includes means, covariances and the coefficients of the linear combination. Expectation Maximization (EM) is usually used to find the maximum likelihood parameters of the mixture model. In each iteration, E-step estimates the conditional distribution of the latent variables. M-step finds the model parameters that maximize the likelihood.

In variational Bayesian Gaussian mixture model (VBGMM), M-step is generalized into full Bayesian estimation, where the parameters are represented by the posterior distribution, instead of only single value like in maximum- likelihood estimation.

On the other hand, Dirichlet process Gaussian mixture model (DPGMM) allows a mixture of infinite Gaussian distributions. It uses Dirichlet process as a nonparametric prior of the distribution parameters, and the number of components could vary according to the data. Therefore, one does not have to preset the number of components ahead of time. The simplest way to infer DPGMM is Monte-Carlo Markov chain (MCMC), but it is generally slow to converge. In Blei’s paper, truncated variational inference is proposed, which converges faster than MCMC.

However, in scikit-learn, the implementation suffers from interface incompatibility, incorrect model training and incompleteness of testing, which prohibits the widely use of these models.

Next

In the rest of bonding time, I will continue reading the related papers. The next post will be about mathematical derivation. Stay tuned.