`_check_feature_names` raises UserWarning when accessing bagged estimators
See original GitHub issueDescribe the bug
Using the estimators inside a BaggingRegressor
to predict data raises a
UserWarning: X has feature names, but DecisionTreeRegressor was fitted without feature names
coming from _check_feature_names
, even if the BaggingRegressor
or its base estimator (in this case DecisionTreeRegressor
) are able to take the feature names into account while fitting.
Steps/Code to Reproduce
from sklearn.ensemble import BaggingRegressor
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"feature_name": [-12.32, 1.43, 30.01, 22.17],
"target": [72, 55, 32, 43],
}
)
X = df[["feature_name"]]
y = df["target"]
bagged_trees = BaggingRegressor()
bagged_trees.fit(X, y)
bagged_trees_predictions = bagged_trees.predict(X) # rises no warning
bagged_trees.estimators_[0].predict(X) # rises UserWarning
Expected Results
No warning should be thrown
Actual Results
/home/arturoamor/miniforge3/envs/scikit-learn-course/lib/python3.9/site-packages/sklearn/base.py:438: UserWarning: X has feature names, but DecisionTreeRegressor was fitted without feature names warnings.warn( array([72., 55., 32., 32.])
Versions
System: python: 3.9.5 | packaged by conda-forge | (default, Jun 19 2021, 00:32:32) [GCC 9.3.0] executable: /home/arturoamor/miniforge3/envs/scikit-learn-course/bin/python machine: Linux-5.13.0-1017-oem-x86_64-with-glibc2.31
Python dependencies: pip: 21.1.3 setuptools: 49.6.0.post20210108 sklearn: 1.0.1 numpy: 1.21.0 scipy: 1.7.0 Cython: None pandas: 1.3.0 matplotlib: 3.4.2 joblib: 1.0.1 threadpoolctl: 2.1.0
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (10 by maintainers)
Top GitHub Comments
And now to 1.3 😃
Mentioning the problem to @ogrisel IRL, he was mentioning that this is still a bit annoying. This might not be a blocker for 1.0.2 thought but we could try to fix it.
We have a couple of things to have in mind here: the
RandomForest
(or other ensemble methods) does not want to validate the data for non-finite values for each underlying estimator because it will be too costly. The data validation thus make sense in the ensemble estimator indeed. However, we need to find a mechanism to go around the issue. One possibility would be to attach the metadata such as (n_features_in_
,feature_names_
, etc.) to each of the trees. We potentially need to be careful to the bootstrap sampling to have something that make sense.