question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Feedback after scikit-learn module privatization

See original GitHub issue

I want to give back some feedback regarding the scikit-learn module privatization which we need to handle in imbalanced-learn.

Imports triggering some errors in imbalanced-learn

So here is the list of the different imports which were currently failing in scikit-learn:

  • from sklearn.ensemble.base import _set_random_states
  • from sklearn.ensemble.forest import _parallel_build_trees
  • from sklearn.metrics.classification import _check_target _classification
  • from sklearn.utils.testing import set_random_state -> not defined in __all__
  • from sklearn.utils.testing import assert_allclose_dense_sparse -> not defined in __all__
  • from sklearn.utils.testing import assert_no_warnings -> not defined in __all__

While one could argue that the imports with leading underscore would be the issue of the contrib packages, the failure with the testing modules is more problematic. The imports are failing because these functions are not defined in __all__ in _testing.py and thus, are not imported in testing.py by calling from _testing import *.

So for the previous imports, we might have 2 solutions:

  1. add the function in __all__
  2. use the __getattr__ from PEP 562 in which in scikit-learn we could manage to still import thing if we get an ImportError.

Now, I want to emphasize why one package might use the leading underscore function. In imbalanced-learn, we implemented the BalancedRandomForest which take some components from the RandomForest. Basically, we need to resample data and call _parallel_build_trees. As a developer, it makes no sense to reimplement our own codebase for this case.

So we can easily update master and create a release at the same time than scikit-learn and catch the error if there is one. However, this might not be the case of other contrib packages.

Update for the contrib packages

So on the side of the contrib packages, master can easily be updated by fixing the import. However, the problem will start when people will update scikit-learn and contrib packages will not create a follow-up release or that a user just update scikit-learn without updating the contrib project. In short, there is no easy way for the contrib here. The backward compatibility should be preserved in scikit-learn.

Conclusion

So I think that this issue is a blocker for the release. We should just agree until which level we think this is fine to break backwards compatibility.

I think this is worth mentioning that we need to advertise the 0.22 RC to packages which might extend scikit-learn to get these changes rights and not upset people which relies on scikit-learn. For instance, having feedback from @rasbt in mlxtend and other developers implementing scikit-learn-contrib packages could be great.

ping @scikit-learn/core-devs

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
rthcommented, Oct 31, 2019

E.g., a solution could be that if a certain private function or method stops working in a new scikit-learn release, the contrib package maintainer could

The problem here is more that some of imports paths that were public (but undocumented in Reference API) that could have broken. I think if we can fix it with a minimal effort in scikit-learn we should. The problem is not only contrib project maintainers, but also that it might produce errors for users once they update scikit-learn (and use contrib packages) which is not great for user experience.

In general it would be great to have the next 0.22 Release candidate widely tested.

1reaction
rasbtcommented, Oct 31, 2019

Good points, and thanks for thinking of us contrib-package developers 😃. I would not worry about these private modules too much as a blocker; they are clearly marked as private and come with an “use with care” implication. Developers of external/contrib packages are usually pretty code-savy (compared to the majority scikit-learn users) and could address issues related to changes in these via a “packagename/externals” submodule.

E.g., a solution could be that if a certain private function or method stops working in a new scikit-learn release, the contrib package maintainer could

  • adjust the code in the contrib version to work with the new version of the private function or method
  • move the old version of the private function/method into a “packagename/externals” submodule of the contrib package
  • handle multiple scikit-learn version compatibility via “if version <= … then; else” statements for the next 1-2 years until everyone upgraded (and/or add a warning)
Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.ensemble.RandomForestClassifier
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses...
Read more >
Definitive Guide to the Random Forest Algorithm with Python ...
In this practical, hands-on, in-depth guide - learn everything you need to know about decision trees, ensembling them into random forests ...
Read more >
Scikit-Learn API (tune.sklearn) - the Ray documentation
See https://scikit-learn.org/stable/modules/model_evaluation.html #scoring-parameter for all options. For evaluating multiple metrics, either give a ...
Read more >
Random Forest Classifier using Scikit-learn - GeeksforGeeks
importing Scikit-learn library and datasets package. from sklearn import datasets ... using metrics module for accuracy calculation.
Read more >
How To Compare Machine Learning Algorithms in Python with ...
When you work on a machine learning project, you often end up with ... discover exactly how you can do that in Python...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found