question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sklearn pipeline and cross_val_score don't work for some transformers

See original GitHub issue

Hi,

from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from dask_ml.decomposition import PCA
from dask_ml.wrappers import ParallelPostFit
from dask_ml.preprocessing import StandardScaler

clf = ParallelPostFit(estimator=GradientBoostingClassifier(), scoring='accuracy')

pipe = make_pipeline(PCA(),clf)
pipe = make_pipeline(StandardScaler(),clf)

mysc = cross_val_score(pipe, dataset.iloc[:,:-1], dataset.iloc[:,-1])

Work for pipe = make_pipeline(StandardScaler(),clf) but not for pipe = make_pipeline(PCA(),clf) Got the error: AttributeError: ‘DataFrame’ object has no attribute ‘chunks’

If I use RobustScaler() instead of StandardScaler(): AttributeError: ‘int’ object has no attribute ‘ndim’

How can I fix this problem?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Nov 2, 2019

You should pass ndarrays to ParallelPostFit.fit if that’s what the estimator (SVC in this case) is expecting.

On Sat, Nov 2, 2019 at 9:13 AM magehex notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger Yes, I know that. This is inconvenient. But it seems there is also an other bug in ParallelPostFit:

from sklearn.ensemble import GradientBoostingClassifier from sklearn.svm import SVC from dask_ml.wrappers import ParallelPostFit from dask_ml.datasets import make_classification

X, y = make_classification(n_samples=1000, random_state=0, chunks=1000)

clf = ParallelPostFit(estimator=SVC(), scoring=‘accuracy’) #clf = ParallelPostFit(estimator=GradientBoostingClassifier(), scoring=‘accuracy’) clf.fit(X,y)

Get error: TypeError: unhashable type: ‘Array’

If I use GradientBoostingClassifier() it works

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/578?email_source=notifications&email_token=AAKAOIXN7STZSGDLMEWWXNTQRV4GPA5CNFSM4JIAHUH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC43TMY#issuecomment-549042611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIX45ENUZU542BNPQWTQRV4GPANCNFSM4JIAHUHQ .

0reactions
magehexcommented, Nov 2, 2019

@TomAugspurger Yes, I know that. This is inconvenient. But it seems there is also an other bug in ParallelPostFit:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from dask_ml.wrappers import ParallelPostFit
from dask_ml.datasets import make_classification

X, y = make_classification(n_samples=1000, random_state=0, chunks=1000)

clf = ParallelPostFit(estimator=SVC(), scoring='accuracy')
#clf = ParallelPostFit(estimator=GradientBoostingClassifier(), scoring='accuracy')
clf.fit(X,y)

Get error: TypeError: unhashable type: ‘Array’

If I use GradientBoostingClassifier() it works

Read more comments on GitHub >

github_iconTop Results From Across the Web

It's possible to apply transform operations only to the training ...
I am trying to include steps in a pipeline that transform the data, ex. balancing the dataset. This pipeline is intended to be...
Read more >
Python Pipeline Error When Used in Cross Validation
I have created a pipeline that performs some pre-processing tasks on data. It works. Includes some customer transformers. Here is the code. from ......
Read more >
10. Common pitfalls and recommended practices - Scikit-learn
The pipeline is ideal for use in cross-validation and hyper-parameter tuning functions. 10.3. Controlling randomness¶. Some scikit-learn objects are inherently ...
Read more >
How To Write Clean And Scalable Code With Custom ...
Pipeline is a scalable framework used in scikit-learn package. ... we will create 6 custom transformers in order for the pipeline to work....
Read more >
Scikit-learn pipelines and pandas - Kaggle
There are several things I don't like with this approach and want to raise ... Luckily, all we need to do is to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found