RFC introduce methods to get and set estimators' state
See original GitHub issueRight now clone
uses {get, set}_params
to replicate an unfit estimator. These methods are designed to return esimators’ hyperparameters. At the moment, we have no way of getting the state of a fitted estimator in a non-pickle format.
Pickle files are by design able to run arbitrary code, and therefore one should ideally only load a pickle file from a trusted source. This makes sharing and moving scikit-learn based estimators hard, which also introduces security issues when deploying ML models in production.
Another issue with pickle files is that we kinda force people to use the same versions of the libraries they used to train the model and dump the pickle. This prevents people from being able to update their base docker images when they’re deploying a model which was trained a while ago, and I’m not sure if we have good ways of letting them update their pickle files for a new version.
My proposal is to introduce {get, set}_state
methods on the BaseEstimator
level be able to persist and set the state of models in a more portable, secure, and backward compatible way. We can probably even just do JSON.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:17 (15 by maintainers)
Top GitHub Comments
I share @thomasjpfan 's concerns about support for backward-compatibility.
Also, (and I’m not good at asking subtle questions): is HuggingFace planning on building a scikit-learn model hub?
Another use case for sklearn-to-sklearn without vulnerability to arbitrary code injection a-la pickle would be to make it possible to host a public model auditing service where you would upload a trained scikit-learn pipeline and be able to run any Python based auditing tools based on either scikit-learn’s own inspection tools or third-party scikit-learn compatible tools such as SHAP, FACET, ELI5, interpretml, fairlearn and so on.