Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

validation_data parameter to fit with preprocessing

See original GitHub issue

It’s pretty common in keras to pass validation_data to fit to monitor how the loss behaves out of sample for each epoch during training.

I noticed that scikeras offers two solutions:

pass validation_split in the initialization
pass fit__validation_data in the initialization

Since I need to preprocess X and y before fitting I cannot use validation_split, because this would cause data leakage so I should opt for solution 2 but this makes the model stateful since the fit__validation_data is attached to the model instance.

Do you have any suggestion for this problem?

Many thanks

Issue Analytics

State:
Created 3 years ago
Comments:29 (14 by maintainers)

Top GitHub Comments

1reaction

gioxc88commented, Nov 20, 2020

@gioxc88 does the custom callback work for your use case?

This still leaves the question of what to do with this parameter. It seems like it would be better off in fit. That is what LightGBM does, and it has been at least proposed in sklearn (scikit-learn/scikit-learn#18748). The other option of course is to remove it completely (users could still use it, it would just be undocumented).

I am already using callbacks for other purposes and I am aware I could have used callbacks for this as well.

However it requires a lot more effort and the main drawback is that, for saving the output of the validation losses I would need to create a new instance attribute on self.model which is definitely not good practice.

As I said as long as I can use fit__validation_data, even if undocumented, I am ok and the I issue can be closed because I would subclass KerasRegressor as I want to modify the signature of the function to accept **fit_params.

The reason why I opened the issue is because I think that scikeras would benefit from this change.

Keeping track of the loss on the hold out set vs training set is one of the most important aspects of any serious ML workflow which involves Neural Networks, as well as data leakage which very often disregarded. For this reason passing validation_data to fit should only be natural and very easy to do.

Unfortunately I believe that both proposed solution don’t add any advantages to just changing the signature of fit to accept **fit_params. I can only see disadvantages in using either callbacks or for loops + partial fit.

That being said, whatever you decide, thank you for considering my thoughts on this. I believe this is an excellent package!

Many thanks Gio

0reactions

adriangbcommented, Feb 16, 2021

Hi @gioxc88 , sorry to bother you again.

We are discussing implementing a feature that might help with your use case. This is not a replacement for **kwargs, merely another option for when those are not possible (grid search, cross validation, etc.). Would you be able to take a look at this example implementation and/or the DatasetTransformer section in these docs? Thank you!