AttributeError when using dask_ml.model_selection.kfold object
See original GitHub issueWhat happened:
I’m currently trying to create a pipeline for model training using LogisticRegression
and Nested cross-validation. I’ve got an unexpected AttributeError
during the pipeline execution.
Exception: AttributeError("'numpy.ndarray' object has no attribute 'chunks'")
What you expected to happen:
I wasn’t expecting that since, I double-checked that all the objects are dask.array
. The following MWE shows what my pipeline looks like.
Minimal Complete Verifiable Example:
from typing import Tuple, Any
from dask.array.core import Array
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from dask_ml.model_selection import GridSearchCV, KFold
from dask_ml.linear_model import LogisticRegression
from dask_ml.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.base import is_classifier
from dask.distributed import Client, progress
import dask.array as da
import numpy as np
import joblib
from dask_ml.datasets import make_classification as dask_make_classification
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action="ignore", category=FutureWarning)
def fake_dataset() -> Tuple[Array, Array]:
X, y = dask_make_classification(
n_samples=1000,
n_features=20,
random_state=1,
n_informative=10,
n_redundant=10,
chunks=1000 // 20,
)
return X, y
def train_model(X: Array, y: Array) -> None:
n_outer_splits = 2
n_inner_splits = 2
param_grid = [
{
"classifier": [LogisticRegression()],
"classifier__penalty": ["l1", "l2"],
"classifier__C": np.logspace(-4, 4, 20),
"classifier__solver": ["liblinear"],
},
]
# define the model
pipeline = Pipeline([("classifier", LogisticRegression())])
# XXX: check that is a proper model
try:
if not is_classifier(pipeline["classifier"]):
raise Exception("Not valid classification algorithm")
except Exception as e:
print(f"Be aware of: {e}")
finally:
pass
# set-up the nested cross-validation procedure
cv_outer = KFold(n_splits=n_outer_splits, shuffle=True, random_state=1)
# enumerate splits
outer_results = list()
for kth_fold, (train_ix, test_ix) in enumerate(cv_outer.split(X)):
print(f"Running {kth_fold} Fold")
# split data
X_train, X_test = X[train_ix, :], X[test_ix, :]
y_train, y_test = y[train_ix], y[test_ix]
# setup inner cross-validation procedure
cv_inner = KFold(n_splits=n_inner_splits, shuffle=True, random_state=1)
# define search
search = GridSearchCV(
estimator=pipeline,
param_grid=param_grid,
scoring="accuracy",
cv=cv_inner,
refit=True,
)
with joblib.parallel_backend("dask"):
result = search.fit(X_train, y_train)
return None
if __name__ == "__main__":
client = Client(
processes=False, threads_per_worker=1, n_workers=4, memory_limit="10GB"
)
X, y = fake_dataset()
train_model(X, y)
Anything else we need to know?:
Further debugging showed that the error comes from fit operation, however, there are not atrbiutes using np.ndarray
objects.
Environment:
- Dask version: 1.9.0
- Python version: 3.8
- Operating System: Ubuntu 20.04 LTS
- Install method (conda, pip, source): conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
dask_ml.model_selection.KFold - Dask-ML
Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then...
Read more >dask-ml/_search.py at main - model_selection - GitHub
Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 3-fold cross validation,.
Read more >Python 3.x - AttributeError: 'function' object has no attribute 'Kfold'
The KFold function is in the sklearn.model_selection module not in sklearn.model_selection.cross_validate. So you sould import
Read more >K-Fold Cross-Validation in Sklearn - Javatpoint
The k-fold cross-validation method is widely used for calculating how well a machine learning model performs on a validation dataset. Although 10 is...
Read more >Why n-split is not possible for a dataframe with KFold?
On running below code on python 3.7, I am getting the following response: 'DataFrame' object has no attribute 'n_splits'. How to get rid...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks. You might want to try using the single-threaded scheduler (at least not using the distributed schedule) which should give cleaner tracebacks.
I’m trying to narrow down where things are converted to numpy arrays.
Hi everyone,
I have run the pytests. Some initially failed because
dask.Array
instances are mutable unlikedask.Delayed
but this was easily fixed by callingid
. However, I am unable to fix one of the failures.The
test_grid_search_dask_dataframe
function repeatedly fails because “DataFrame.iloc
only supports selecting columns. It must be used likedf.iloc[:, column_indexer]
.” The issue arises when the_pandas_indexing
function attempts to split the data frame along its rows into train and test parts.In the past, this issue likely went undiscovered because much like the implicit Dask to NumPy conversion leading to the initial issue, there is an implicit Dask to Pandas conversion. As such, if
test_grid_search_dask_dataframe
is run with aDataFrame
whose total size exceeds the worker memory limit, then a loop of warnings and failures will occur.@TomAugspurger, given your advanced experience with Dask, I think it would be better for you to decide how to handle this data frame issue. If you would like I can push the changes that I have made so you can look at them.