question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Gaussian Mixture with BIC/AIC

See original GitHub issue

Describe the workflow you want to enable

Clustering with Gaussian mixture modeling frequently entails choosing the best model parameter such as the number of components and covariance constraint. This demonstration is very helpful to me but I think it might be great to have a class like LassoLarsIC that does the job automatically.

Describe your proposed solution

Add a class (say GaussianMixtureIC, for example) that automatically selects the best GM model based on BIC or AIC among a set of models. As mentioned above, the set of models would be parameterized by:

  • Initialization scheme, which could be random, k-means or agglomerative clusterings (as done in mclust, see below)
  • Covariance constraints
  • Number of components

Additional context

mclust is a package in R for GM modeling. The original publication and the most recent version have been cited in 2703 and 956 articles, respectively (Banfield & Raftery, 1993; Scrucca et al., 2016). It incorporates different initialization strategies (including agglomerative clusterings) for EM algorithm and enables automatic model selection via BIC for different combinations of clustering options (Scrucca et al., 2016).

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:6
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
bdpedigocommented, Feb 11, 2021

Just to briefly clarify the mclust algorithm, loosely speaking (what we are proposing to implement here):

  1. run agglomerative clustering (with different options for linkage, affinity, etc.) to generate a set of initial labelings. The same run of agglomerative clustering can be used for various levels of n_components.
  2. fit GaussianMixtures using the various different parameters. roughly this amounts to sweeping over {initializations} x {n_components} x {covariance types}.
  3. choose the best model based on BIC on the whole dataset.

As far as we can tell, the above isn’t trivially accomplished with GridSearchCV for a few reasons (some of which were already mentioned above, but just repeating here for clarity):

  • Running agglomerative clustering with multiple different settings, then extracting the appropriate “flat” clustering for the right value of n_components is not hard, but does take a bit of code.
  • Computing the initial parameters given these clusterings also takes a bit of code, as GaussianMixture currently can only be initialized by the means, precisions, and weights and not by the responsibilities (e.g. cluster labels for each point like agglomerative gives us).
  • There is no cross-validation involved, meaning one would have to use the “dummy” cross-validation solution described above.
  • There are also some details about how mclust handles the covariance regularization which don’t lend themselves to naive grid search easily.

We are more than happy to talk about details of how to best implement the above, should it be desired in sklearn. We do think that the functionality above is (1) useful, given the success of mclust (for instance mclust had 168k downloads last month, and the >3600 citations mentioned above), and (2) not currently easy to run in sklearn with the given tools like GridSearchCV given all of the reasons above. While it wouldn’t be impossible for a user to do it that way, there are enough steps involved (and would require the user to be pretty familiar with mclust already) that we thought a specific class to wrap up all of the above would be convenient and useful for the community.

0reactions
NicolasHugcommented, Feb 9, 2021

Happy to help! If the suggested snippet above suits your needs, perhaps we can close the issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Gaussian Mixture Model Selection - Scikit-learn
This example shows that model selection can be performed with Gaussian Mixture Models (GMM) using information-theory criteria. Model selection concerns both ...
Read more >
Assessing the Number of Clusters in a Mixture Model with ...
As in the case of the Gaussian mixture experiment, the best results are obtained with the criterion BIC whatever the value of K....
Read more >
8. K-means, BIC, AIC - Data Science Topics - One-Off Coder
1. Simpson's Paradox · 2. Generating Random Bayesian Network · 3. Creating a Junction Tree · 4. Inference in Gaussian Networks · 5....
Read more >
python - Gaussian Mixture Model with BIC or AIC on GPU
There is bic/aic criterion with GMM in scikit-learn but I want to fit my data on GPU. I found GMM implemented in CuPy(cuda...
Read more >
MoE_crit: MoEClust BIC, ICL, and AIC Model-Selection Criteria
The log-likelihood for a data set with respect to the Gaussian mixture model specified ... A simplified array containing the BIC, AIC, number...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found