Gaussian Mixture with BIC/AIC
See original GitHub issueDescribe the workflow you want to enable
Clustering with Gaussian mixture modeling frequently entails choosing the best model parameter such as the number of components and covariance constraint. This demonstration is very helpful to me but I think it might be great to have a class like LassoLarsIC
that does the job automatically.
Describe your proposed solution
Add a class (say GaussianMixtureIC
, for example) that automatically selects the best GM model based on BIC or AIC among a set of models. As mentioned above, the set of models would be parameterized by:
- Initialization scheme, which could be random, k-means or agglomerative clusterings (as done in
mclust
, see below) - Covariance constraints
- Number of components
Additional context
mclust
is a package in R for GM modeling. The original publication and the most recent version have been cited in 2703 and 956 articles, respectively (Banfield & Raftery, 1993; Scrucca et al., 2016). It incorporates different initialization strategies (including agglomerative clusterings) for EM algorithm and enables automatic model selection via BIC for different combinations of clustering options (Scrucca et al., 2016).
Issue Analytics
- State:
- Created 3 years ago
- Reactions:6
- Comments:7 (6 by maintainers)
Top GitHub Comments
Just to briefly clarify the mclust algorithm, loosely speaking (what we are proposing to implement here):
n_components
.GaussianMixtures
using the various different parameters. roughly this amounts to sweeping over{initializations} x {n_components} x {covariance types}
.As far as we can tell, the above isn’t trivially accomplished with
GridSearchCV
for a few reasons (some of which were already mentioned above, but just repeating here for clarity):n_components
is not hard, but does take a bit of code.GaussianMixture
currently can only be initialized by the means, precisions, and weights and not by the responsibilities (e.g. cluster labels for each point like agglomerative gives us).We are more than happy to talk about details of how to best implement the above, should it be desired in sklearn. We do think that the functionality above is (1) useful, given the success of mclust (for instance mclust had 168k downloads last month, and the >3600 citations mentioned above), and (2) not currently easy to run in sklearn with the given tools like
GridSearchCV
given all of the reasons above. While it wouldn’t be impossible for a user to do it that way, there are enough steps involved (and would require the user to be pretty familiar with mclust already) that we thought a specific class to wrap up all of the above would be convenient and useful for the community.Happy to help! If the suggested snippet above suits your needs, perhaps we can close the issue?