Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AttributeError when using dask_ml.model_selection.kfold object

See original GitHub issue

What happened: I’m currently trying to create a pipeline for model training using LogisticRegression and Nested cross-validation. I’ve got an unexpected AttributeError during the pipeline execution.

Exception: AttributeError("'numpy.ndarray' object has no attribute 'chunks'")

What you expected to happen: I wasn’t expecting that since, I double-checked that all the objects are dask.array. The following MWE shows what my pipeline looks like.

Minimal Complete Verifiable Example:

from typing import Tuple, Any
from dask.array.core import Array
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from dask_ml.model_selection import GridSearchCV, KFold
from dask_ml.linear_model import LogisticRegression
from dask_ml.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.base import is_classifier
from dask.distributed import Client, progress
import dask.array as da

import numpy as np
import joblib
from dask_ml.datasets import make_classification as dask_make_classification

# import warnings filter
from warnings import simplefilter

# ignore all future warnings
simplefilter(action="ignore", category=FutureWarning)


def fake_dataset() -> Tuple[Array, Array]:
    X, y = dask_make_classification(
        n_samples=1000,
        n_features=20,
        random_state=1,
        n_informative=10,
        n_redundant=10,
        chunks=1000 // 20,
    )
    return X, y


def train_model(X: Array, y: Array) -> None:
    n_outer_splits = 2
    n_inner_splits = 2
    param_grid = [
        {
            "classifier": [LogisticRegression()],
            "classifier__penalty": ["l1", "l2"],
            "classifier__C": np.logspace(-4, 4, 20),
            "classifier__solver": ["liblinear"],
        },
    ]
    # define the model
    pipeline = Pipeline([("classifier", LogisticRegression())])
    # XXX: check that is a proper model
    try:
        if not is_classifier(pipeline["classifier"]):
            raise Exception("Not valid classification algorithm")
    except Exception as e:
        print(f"Be aware of: {e}")
    finally:
        pass
    # set-up the nested cross-validation procedure
    cv_outer = KFold(n_splits=n_outer_splits, shuffle=True, random_state=1)
    # enumerate splits
    outer_results = list()
    for kth_fold, (train_ix, test_ix) in enumerate(cv_outer.split(X)):
        print(f"Running {kth_fold} Fold")
        # split data
        X_train, X_test = X[train_ix, :], X[test_ix, :]
        y_train, y_test = y[train_ix], y[test_ix]
        # setup inner cross-validation procedure
        cv_inner = KFold(n_splits=n_inner_splits, shuffle=True, random_state=1)

        # define search
        search = GridSearchCV(
            estimator=pipeline,
            param_grid=param_grid,
            scoring="accuracy",
            cv=cv_inner,
            refit=True,
        )
        with joblib.parallel_backend("dask"):
            result = search.fit(X_train, y_train)

    return None


if __name__ == "__main__":
    client = Client(
        processes=False, threads_per_worker=1, n_workers=4, memory_limit="10GB"
    )

    X, y = fake_dataset()
    train_model(X, y)

Anything else we need to know?: Further debugging showed that the error comes from fit operation, however, there are not atrbiutes using np.ndarray objects.

Environment:

Dask version: 1.9.0
Python version: 3.8
Operating System: Ubuntu 20.04 LTS
Install method (conda, pip, source): conda

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Aug 19, 2021

Thanks. You might want to try using the single-threaded scheduler (at least not using the distributed schedule) which should give cleaner tracebacks.

I’m trying to narrow down where things are converted to numpy arrays.

On Aug 19, 2021, at 8:45 AM, Miguel Ángel Cárdenas @.***> wrote:

Sure thing @TomAugspurger

Running 0 Fold /home/makquel/anaconda3/envs/model_professor_2/lib/python3.8/site-packages/dask/array/slicing.py:1080: PerformanceWarning: Increasing number of chunks by factor of 14 p = blockwise( distributed.worker - WARNING - Compute Failed Function: cv_split args: (KFold(n_splits=2, random_state=1, shuffle=True), array([[ 0.19495857, -0.8850293 , -0.40491901, …, 1.33906018, -1.25807389, 0.04787755], [ 0.25033412, 0.62342772, 0.43649497, …, -0.18664019, 2.01090793, -0.33272972], [ 0.07370212, -0.85365332, 1.06964283, …, -0.30508485, -1.17534124, 1.20492622], …, [ 1.38853841, 1.56361377, 0.34933573, …, -0.69068144, -1.10318088, 0.26826262], [ 0.7677924 , 0.19290652, 1.48708061, …, 0.53962508, -1.22606228, 0.11060196], [-0.65174484, 1.41971851, -0.00285028, …, -1.24113283, -1.30192431, -1.98192271]]), array([0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1 kwargs: {} Exception: AttributeError(“‘numpy.ndarray’ object has no attribute ‘chunks’”)

(‘score-441cbc92926e2852d676f4d612a0c376’, 1, 1) has failed… retrying distributed.worker - WARNING - Compute Failed Function: cv_split args: (KFold(n_splits=2, random_state=1, shuffle=True), array([[ 0.19495857, -0.8850293 , -0.40491901, …, 1.33906018, -1.25807389, 0.04787755], [ 0.25033412, 0.62342772, 0.43649497, …, -0.18664019, 2.01090793, -0.33272972], [ 0.07370212, -0.85365332, 1.06964283, …, -0.30508485, -1.17534124, 1.20492622], …, [ 1.38853841, 1.56361377, 0.34933573, …, -0.69068144, -1.10318088, 0.26826262], [ 0.7677924 , 0.19290652, 1.48708061, …, 0.53962508, -1.22606228, 0.11060196], [-0.65174484, 1.41971851, -0.00285028, …, -1.24113283, -1.30192431, -1.98192271]]), array([0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1 kwargs: {} Exception: AttributeError(“‘numpy.ndarray’ object has no attribute ‘chunks’”) Would be that enough to diagnose?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

0reactions

ChadGuelicommented, Apr 7, 2022

Hi everyone,

I have run the pytests. Some initially failed because dask.Array instances are mutable unlike dask.Delayed but this was easily fixed by calling id. However, I am unable to fix one of the failures.

The test_grid_search_dask_dataframe function repeatedly fails because “DataFrame.iloc only supports selecting columns. It must be used like df.iloc[:, column_indexer].” The issue arises when the _pandas_indexing function attempts to split the data frame along its rows into train and test parts.

In the past, this issue likely went undiscovered because much like the implicit Dask to NumPy conversion leading to the initial issue, there is an implicit Dask to Pandas conversion. As such, if test_grid_search_dask_dataframe is run with a DataFrame whose total size exceeds the worker memory limit, then a loop of warnings and failures will occur.

@TomAugspurger, given your advanced experience with Dask, I think it would be better for you to decide how to handle this data frame issue. If you would like I can push the changes that I have made so you can look at them.

Top Results From Across the Web

dask_ml.model_selection.KFold - Dask-ML

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then...

dask-ml/_search.py at main - model_selection - GitHub

Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 3-fold cross validation,.

Python 3.x - AttributeError: 'function' object has no attribute 'Kfold'

The KFold function is in the sklearn.model_selection module not in sklearn.model_selection.cross_validate. So you sould import

K-Fold Cross-Validation in Sklearn - Javatpoint

The k-fold cross-validation method is widely used for calculating how well a machine learning model performs on a validation dataset. Although 10 is...

Why n-split is not possible for a dataframe with KFold?

On running below code on python 3.7, I am getting the following response: 'DataFrame' object has no attribute 'n_splits'. How to get rid...