question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dealing with variable length inputs

See original GitHub issue

Let’s assume we are working with variable length inputs. One of the strongest parts in using tf.data.Dataset is the ability to pad batches as they come.

But since scikit-learn’s API is mainly focused around dataframes and arrays, incorporating this is kind of hard. Obviously, you can pad everything, but this can be a huge waste of memory. I’m trying to work with the sklearn.pipeline.Pipeline object, and I thought to myself "alright, I’ll just create a custom transformer at the end of my pipeline just before the model, and make it return a tf.data.Dataset object to later plug in my model. But this is not possible since the .transform signature only accepts X and not y, while you’ll need both to work with tf.data.Dataset.

So assume we have 4 features for each data point, and each has it’s own sequence length, for example a datapoint might look like this:

sample_features = {'a': [1,2,3], 'b': [1,2,3,4,5], 'c': 1, 'd': [1,2]}
sample_label = 0

How will I be able to manage this kind of dataset under scikit learn + scikeras?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
eliorccommented, Aug 22, 2021

@adriangb I’ve put myself a reminder and I’ll try ot look at this next weekend

EDIT

Just to not leave my end open, at the end I did not have time to see this through… Though still interested in the incorporation of tf.data.Dataset in SciKeras 😄

1reaction
adriangbcommented, Jan 10, 2021

“the current two transformers” do you mean the feature and target encoders?

Yep

I’ll try to dig into the internals of scikeras maybe next weekend (weekend in Israel is Friday-Saturday)

Please enjoy your weekend! No rush.

If would save me some time if you can hyperlink me to the code that lives between the transformers to the scikeras model

Sure thing. The jist of it is that these are dependency injection points for users to insert custom data transformations. Calling BaseWrapper.fit instantiates and fits the transformers here. Adding another transformer just consists of adding a some default transformers (sklearn.preprocessing.FunctionTransformer) and a couple of lines to instantiates and fit the transformer. I think the hardest part is going to be figuring out the signature of the transformer since it’ll be non-standard (Sklearn accepts only 1 parameter, we need 2 or a tuple).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Neural Network for input of variable length using Tensorflow ...
Guide on how to deal with the case in which we have inputs (usually signals) of variable length, using the Tensorflow TimeDistributed ...
Read more >
How should I deal with variable-length inputs for neural ...
You first define the desired sequence length, i.e. the input length you want your model to have. Then any sequences with a shorter...
Read more >
Data Preparation for Variable Length Input Sequences
In the case of variable length sequence prediction problems, this requires that your data be transformed such that each sequence has the same ......
Read more >
How do you handle input vectors of a variable length? - Reddit
Hi r/MachineLearning , noob here: I was wondering how to structure a neural net where the inputs are of an undetermined size.
Read more >
Variable-Length Sequences in TensorFlow Part 1 - Carted
Part 3: continue our discussion of handling variable-length text ... In NLP tasks, it's common practice to first tokenize the input ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found