question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

validation_data parameter to fit with preprocessing

See original GitHub issue

It’s pretty common in keras to pass validation_data to fit to monitor how the loss behaves out of sample for each epoch during training.

I noticed that scikeras offers two solutions:

  1. pass validation_split in the initialization
  2. pass fit__validation_data in the initialization

Since I need to preprocess X and y before fitting I cannot use validation_split, because this would cause data leakage so I should opt for solution 2 but this makes the model stateful since the fit__validation_data is attached to the model instance.

Do you have any suggestion for this problem?

Many thanks

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:29 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
gioxc88commented, Nov 20, 2020

@gioxc88 does the custom callback work for your use case?

This still leaves the question of what to do with this parameter. It seems like it would be better off in fit. That is what LightGBM does, and it has been at least proposed in sklearn (scikit-learn/scikit-learn#18748). The other option of course is to remove it completely (users could still use it, it would just be undocumented).

I am already using callbacks for other purposes and I am aware I could have used callbacks for this as well.

However it requires a lot more effort and the main drawback is that, for saving the output of the validation losses I would need to create a new instance attribute on self.model which is definitely not good practice.

As I said as long as I can use fit__validation_data, even if undocumented, I am ok and the I issue can be closed because I would subclass KerasRegressor as I want to modify the signature of the function to accept **fit_params.

The reason why I opened the issue is because I think that scikeras would benefit from this change.

Keeping track of the loss on the hold out set vs training set is one of the most important aspects of any serious ML workflow which involves Neural Networks, as well as data leakage which very often disregarded. For this reason passing validation_data to fit should only be natural and very easy to do.

Unfortunately I believe that both proposed solution don’t add any advantages to just changing the signature of fit to accept **fit_params. I can only see disadvantages in using either callbacks or for loops + partial fit.

That being said, whatever you decide, thank you for considering my thoughts on this. I believe this is an excellent package!

Many thanks Gio

0reactions
adriangbcommented, Feb 16, 2021

Hi @gioxc88 , sorry to bother you again.

We are discussing implementing a feature that might help with your use case. This is not a replacement for **kwargs, merely another option for when those are not possible (grid search, cross validation, etc.). Would you be able to take a look at this example implementation and/or the DatasetTransformer section in these docs? Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How does the validation_split parameter of Keras' fit function ...
Does it means that validation data is always fixed and taken from bottom of main dataset? Is there any way it can be...
Read more >
Training & evaluation with the built-in methods - Keras
Using a validation dataset. You can pass a Dataset instance as the validation_data argument in fit() : model ...
Read more >
XGboost: cannot pass validation data for eval_set in pipeline
fit () the model with Xgboost parameters; dump the fitted pipeline. as follows: from sklearn.preprocessing import StandardScaler from sklearn.
Read more >
Cross-Validation and Hyperparameter Tuning: How to ...
In the first two parts of this article I obtained and preprocessed Fitbit sleep data, split the data into training, validation and test...
Read more >
Evaluate the Performance of Deep Learning Models in Keras
You can do this by setting the validation_split argument on the fit() function to a percentage of the size of your training dataset....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found