question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

check_estimator not valid for vectorizers

See original GitHub issue

check_estimator is a really powerful tool for deciding scikit-compatibility of estimators through automated checking.

Although this is true, it restricts estimators that do not expect as input numpy arrays of shape [n_samples, n_features], but vector of objects (i.e. Vectorizers) to use this tool.

Mainly, being redirected from my issue in sklearn-template, it would be really helpful if all the constant input needed inside check_estimator could be collected as the values of an object or instantiated as the return values of a function that the user of check_estimator would be able to define, following a certain specification.

An argument can rise for that, if someone considers that any Transformer requires as input numpy arrays of shape [n_samples, n_features]. But if we approach a very common Transformer in sklearn, namely Tf-idf we can see that its input is not an np.array.

To sum up, as scikit-learn aims to be a more and more general library that can standardize a template for developing machine learning packages in python, it would be really essential to support various input formats on the automated checking of consistency of its basic object, the Estimator.

Thank you for your attention!

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jnothmancommented, Oct 31, 2018

I think the point is that we are on top of this, but it is a slow and difficult change. Once our checks are more flexible we can consider particular classes of variant estimators

0reactions
rthcommented, Oct 30, 2018

rth how does #11622 solve this?

Well, given than e.g. check_estimator(CountVectorizer) fails, if individual tests were exposed, a crude first first step could have been to at least skip those that fail for CountVectorizer and keep e.g. pickling checks etc.

I agree though estimators tags would be more general.

I think it is not about disabling checks but transforming them. By observing other tests (in order to bring them in the terms of my package) I witnessed that the most of the time the changes are all about changing the input.

Well the problem is that this would add quite a lot of complexity. Most of checks assume that input is 2D arrays and output is 1D or 2D. Now we consider text vectorizers, in the same way one could say, well I use ML on time series or recommendation algoritm or metric learning and my input data is not a 2D matrix (n_samples, n_features) but I would still like to use check_estimator to check some consistence to the scikit-learn API to the extent it’s possible (i.e. consistency between fit + predict / fit_predict, picklabilty etc.). I hope estimator tags will help, but in general it’s doesn’t sound that simple to generalize current estimator checks to work in all possible use case.

You could say that text vectorizers are in scope (because they are in scikit-learn) and the the rest is not, as far as estimator checks are concerned, but I don’t see a fundamental difference between this and anything else that’s not the usual 2D array input.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Random Forest with GridSearchCV - Error on param_grid
Try to run get_params() on your final pipeline object, not just the estimator. This way it'd generate all available pipe-items unique keys ...
Read more >
Auto-Vectorization in LLVM — LLVM 16.0.0git documentation
The Vectorizer cost model can estimate the cost of the type conversion and decide if vectorization is profitable. void foo(int *A, char *B, ......
Read more >
sklearn.feature_extraction.text.TfidfVectorizer
If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the...
Read more >
Chapter 4. Text Vectorization and Transformation Pipelines
Scikit-Learn was not designed with text in mind, but does offer a robust API and ... from sklearn.feature_extraction.text import CountVectorizer vectorizer ...
Read more >
TF-IDF Vectorizer scikit-learn - Medium
Deep understanding tf-idf calculation by various examples, ... concept inside how is it working than other vectorizer algorithm.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found