Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sklearn pipeline and cross_val_score don't work for some transformers

See original GitHub issue

Hi,

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from dask_ml.decomposition import PCA
from dask_ml.wrappers import ParallelPostFit
from dask_ml.preprocessing import StandardScaler

clf = ParallelPostFit(estimator=GradientBoostingClassifier(), scoring='accuracy')

pipe = make_pipeline(PCA(),clf)
pipe = make_pipeline(StandardScaler(),clf)

mysc = cross_val_score(pipe, dataset.iloc[:,:-1], dataset.iloc[:,-1])

Work for pipe = make_pipeline(StandardScaler(),clf) but not for pipe = make_pipeline(PCA(),clf) Got the error: AttributeError: ‘DataFrame’ object has no attribute ‘chunks’

If I use RobustScaler() instead of StandardScaler(): AttributeError: ‘int’ object has no attribute ‘ndim’

How can I fix this problem?

Issue Analytics

State:
Created 4 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Nov 2, 2019

You should pass ndarrays to ParallelPostFit.fit if that’s what the estimator (SVC in this case) is expecting.

On Sat, Nov 2, 2019 at 9:13 AM magehex notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger Yes, I know that. This is inconvenient. But it seems there is also an other bug in ParallelPostFit:

from sklearn.ensemble import GradientBoostingClassifier from sklearn.svm import SVC from dask_ml.wrappers import ParallelPostFit from dask_ml.datasets import make_classification

X, y = make_classification(n_samples=1000, random_state=0, chunks=1000)

clf = ParallelPostFit(estimator=SVC(), scoring=‘accuracy’) #clf = ParallelPostFit(estimator=GradientBoostingClassifier(), scoring=‘accuracy’) clf.fit(X,y)

Get error: TypeError: unhashable type: ‘Array’

If I use GradientBoostingClassifier() it works

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/578?email_source=notifications&email_token=AAKAOIXN7STZSGDLMEWWXNTQRV4GPA5CNFSM4JIAHUH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC43TMY#issuecomment-549042611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIX45ENUZU542BNPQWTQRV4GPANCNFSM4JIAHUHQ .

0reactions

magehexcommented, Nov 2, 2019

@TomAugspurger Yes, I know that. This is inconvenient. But it seems there is also an other bug in ParallelPostFit:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from dask_ml.wrappers import ParallelPostFit
from dask_ml.datasets import make_classification

X, y = make_classification(n_samples=1000, random_state=0, chunks=1000)

clf = ParallelPostFit(estimator=SVC(), scoring='accuracy')
#clf = ParallelPostFit(estimator=GradientBoostingClassifier(), scoring='accuracy')
clf.fit(X,y)

Get error: TypeError: unhashable type: ‘Array’

If I use GradientBoostingClassifier() it works

Top Results From Across the Web

It's possible to apply transform operations only to the training ...

I am trying to include steps in a pipeline that transform the data, ex. balancing the dataset. This pipeline is intended to be...

Python Pipeline Error When Used in Cross Validation

I have created a pipeline that performs some pre-processing tasks on data. It works. Includes some customer transformers. Here is the code. from ......

10. Common pitfalls and recommended practices - Scikit-learn

The pipeline is ideal for use in cross-validation and hyper-parameter tuning functions. 10.3. Controlling randomness¶. Some scikit-learn objects are inherently ...

How To Write Clean And Scalable Code With Custom ...

Pipeline is a scalable framework used in scikit-learn package. ... we will create 6 custom transformers in order for the pipeline to work....

Scikit-learn pipelines and pandas - Kaggle

There are several things I don't like with this approach and want to raise ... Luckily, all we need to do is to...