Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Gaussian Mixture with BIC/AIC

See original GitHub issue

Describe the workflow you want to enable

Clustering with Gaussian mixture modeling frequently entails choosing the best model parameter such as the number of components and covariance constraint. This demonstration is very helpful to me but I think it might be great to have a class like LassoLarsIC that does the job automatically.

Describe your proposed solution

Add a class (say GaussianMixtureIC, for example) that automatically selects the best GM model based on BIC or AIC among a set of models. As mentioned above, the set of models would be parameterized by:

Initialization scheme, which could be random, k-means or agglomerative clusterings (as done in mclust, see below)
Covariance constraints
Number of components

Additional context

mclust is a package in R for GM modeling. The original publication and the most recent version have been cited in 2703 and 956 articles, respectively (Banfield & Raftery, 1993; Scrucca et al., 2016). It incorporates different initialization strategies (including agglomerative clusterings) for EM algorithm and enables automatic model selection via BIC for different combinations of clustering options (Scrucca et al., 2016).

Issue Analytics

State:
Created 3 years ago
Reactions:6
Comments:7 (6 by maintainers)

Top GitHub Comments

2reactions

bdpedigocommented, Feb 11, 2021

Just to briefly clarify the mclust algorithm, loosely speaking (what we are proposing to implement here):

run agglomerative clustering (with different options for linkage, affinity, etc.) to generate a set of initial labelings. The same run of agglomerative clustering can be used for various levels of n_components.
fit GaussianMixtures using the various different parameters. roughly this amounts to sweeping over {initializations} x {n_components} x {covariance types}.
choose the best model based on BIC on the whole dataset.

As far as we can tell, the above isn’t trivially accomplished with GridSearchCV for a few reasons (some of which were already mentioned above, but just repeating here for clarity):

Running agglomerative clustering with multiple different settings, then extracting the appropriate “flat” clustering for the right value of n_components is not hard, but does take a bit of code.
Computing the initial parameters given these clusterings also takes a bit of code, as GaussianMixture currently can only be initialized by the means, precisions, and weights and not by the responsibilities (e.g. cluster labels for each point like agglomerative gives us).
There is no cross-validation involved, meaning one would have to use the “dummy” cross-validation solution described above.
There are also some details about how mclust handles the covariance regularization which don’t lend themselves to naive grid search easily.

We are more than happy to talk about details of how to best implement the above, should it be desired in sklearn. We do think that the functionality above is (1) useful, given the success of mclust (for instance mclust had 168k downloads last month, and the >3600 citations mentioned above), and (2) not currently easy to run in sklearn with the given tools like GridSearchCV given all of the reasons above. While it wouldn’t be impossible for a user to do it that way, there are enough steps involved (and would require the user to be pretty familiar with mclust already) that we thought a specific class to wrap up all of the above would be convenient and useful for the community.

0reactions

NicolasHugcommented, Feb 9, 2021

Happy to help! If the suggested snippet above suits your needs, perhaps we can close the issue?

Top Results From Across the Web

Gaussian Mixture Model Selection - Scikit-learn

This example shows that model selection can be performed with Gaussian Mixture Models (GMM) using information-theory criteria. Model selection concerns both ...

Assessing the Number of Clusters in a Mixture Model with ...

As in the case of the Gaussian mixture experiment, the best results are obtained with the criterion BIC whatever the value of K....

8. K-means, BIC, AIC - Data Science Topics - One-Off Coder

1. Simpson's Paradox · 2. Generating Random Bayesian Network · 3. Creating a Junction Tree · 4. Inference in Gaussian Networks · 5....

python - Gaussian Mixture Model with BIC or AIC on GPU

There is bic/aic criterion with GMM in scikit-learn but I want to fit my data on GPU. I found GMM implemented in CuPy(cuda...

MoE_crit: MoEClust BIC, ICL, and AIC Model-Selection Criteria

The log-likelihood for a data set with respect to the Gaussian mixture model specified ... A simplified array containing the BIC, AIC, number...