Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot transform unknown length data/chunked data with a Pipeline (no ChunkedTransformer/ChunkedEstimator/ChunkedClassifier)

See original GitHub issue

Description

So you have a more than can fit in memory amount of data in separate files on the disk. There are frameworks like dask which you could use to compile them, but that’s a huge amount of data copying for little benefit. The total data length could be calculated but there is no real need to do that either. There is no way to perform feature selection in chunks with partial fitting followed by concatenated transforming and then fit post transformation on a reasonable size of data using a Pipeline.

Furthermore the transformers may be used for certain features only. So 10000 features becomes 10 transformers with 1000 features each, and the best 5 features of each transform is used, resulting in 50 features. This division of features into the transformers is also important due to memory constraints. Of course ColumnTransformer solves the task of dividing columns among transformers and choosing the best amongst those.

The assumption is enough memory is available for 50 features over all the data. But 1000 features over all the data will not fit in memory. I see no way a Pipeline can support chunked data/unknown length/load on demand data. Basically you want to partial_fit the data, then concatenate while transforming the data then move on to the next stage of the pipeline which could have more transformers and then an estimator. Only the very first stage transformer of the pipeline should need this special operation. However in some cases perhaps the intermediate phases may need to be cached and redone with the same strategy.

Steps/Code to Reproduce

Say there are 1000 files with various rows of data (1000-10000), and 50000 features each. Feature selection will reduce these 50000 features down to a mere 50 by choosing the best. Its assumed the data will fit into memory despite the numbers used after feature selection but not before so a full fit can be done.

What you want to do: Stage 1: partial_fit over and over on the transformers for each chunk of data Stage 2: transform for each chunk of data with all the transformers which will do the feature selection Stage 3: concatenate the results Stage 4: fit on the estimator

This would seem to be an incredibly common idea for data sets which are too large, but feature extraction can reduce them down to a reasonable size.

Solutions

Add ChunkedTransformer to scikit learn. This will perform partial_fit in chunks then transform in chunks and compile the result. I suppose ChunkedEstimator and ChunkedClassifier could also provide wrappers where partial_fit is available or for overly large data in predict. The input X format would need to be slightly different as an iterator of arrays rather than a single array but the output would be the same. Caching in a Pipeline is already supported so intermediate caching and loading is at least possible. These are simple classes to implement as they merely wrap an underlying Transformer/Estimator/Classifier and feed the partial_fit or transform or predict functions from an iterable and compile results. Unfortunately there are a good deal of places where the X parameter is assumed to not be of a generic iterable format and might take some refactoring in those places to deal with it such as in GridSearchCV.

Another idea is that sklearn support callback functions for X in models which can load the data in chunks via iteration. But that’s probably too much of a massive change. And the fit_transform would still need to enumerate the data twice using that. Perhaps a class can do this overriding the various __getitem__ methods to provide a virtualized view of the data. Nonetheless the length is not known so it would have to still fetch it in chunks. Another is a partial_fit method which can be called repeated only the pipeline. Then a transform_fit method automatically will concatenate the results of the transform and fit on the estimator. These 2 ideas are basically dead ends since they do not provide the granularity of flexibility for multiple stages of transformers in the pipeline. The first idea of adding Chunked wrappers is the only real solution.

Is there something I am missing or is this incredibly obvious task not possible with a pipeline? I would like to discretize my data in the transformers, and also have the transformer use a scoring algorithm to choose the best features when discretized. Then finally the fitting can occur. But I would like to do it on a good deal of data. Of course another solution without a pipeline:

partial_fit in a loop until data exhausted.
transform in a loop with concatenation until data exhausted
pass to estimator with fit

But now the benefit of using something like GridSearchCV to tune the feature selection and model together is missing.

Issue Analytics

State:
Created 4 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

adrinjalalicommented, Dec 4, 2019

I’d be happy to see a PR implementing your idea, to see concretely how it works.

Better support for partial_fit has been in our roadmap, but from our perspective it’s a very challenging one and touches a lot of places. But I’m happy to see a prototype of your idea.

0reactions

GregoryMorsecommented, Dec 5, 2019

dask delayed does seem to solve a good deal of what I have alluded to. Chunked data is not really needed in the known length case as dask solves this. The problem is mainly when length computation requires evaluation and thus undoes some benefit of using dask (or a huge amount of caching would occur). The number of features should usually be determinable without computation and not unknown (though theoretically it could be). And since nearly all models need all features simultaneously and do not try to go through them in batches and dask solves the memory constraint issue here, I will keep this as a row only issue as its a practical use case.

But as for unknown row sizes, dask cannot address the fact that the length computation might be undesirable.

The issue here is the model itself need to understand that partial fitting as a phase and partial transforms as a phase eliminates the need for the computation of the data up front.

In this case, the computation is only done once for all the data in the fitting phase. Then in the transform phase, only a single feature will be computed as the others have been eliminated. The extra computation for the single feature, should be cheap enough that having all the data cached during its precomputation dask graph building is simply not worth it.

If one does not care about writing large amounts of pre-computed data to the disk, then certainly dask is the way to go.

I will go ahead and write ChunkedTransformer, submit a PR for it so better discussion can ensue, and probably will still use dask to handle the scalability among the features.

Top Results From Across the Web

Resolving Common Problems - AWS Data Pipeline

If both values are empty or missing, the task cannot start because there is no association between the task and a worker to...

I got error in inference pipeline in Azure machine Learning ...

I had a similar problem that turned out to be caused by the Clean Missing data transformation. I had included the column (...

How to Use the ColumnTransformer for Data Preparation

This may involve replacing missing values, scaling numerical values, and one hot encoding categorical data. Data transforms can be performed ...

6.1. Pipelines and composite estimators - Scikit-learn

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be...

Build Data Transformation Pipelines using Scikit-learn - Medium

This is a crucial part of data transformation because many machine learning models can not work with missing values. Previously, we noticed that ......