question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

tuner with max_model_size not skipping oversized models

See original GitHub issue

When using max_model_size I am finding the tuner repeatedly tries the same oversized model, then errors out. This does not fit with the expected behavior as the warning message says the oversized model will be skipped.

Some dummy code to recreate the issue:

def modelbuilder(hp):
	model = keras.Sequential()
	model.add(keras.layers.Dense(hp.Int('width_1',2,20,step=1,sampling='linear'),input_shape=[100],activation='linear',name='Dense_1'))
	model.add(keras.layers.Dense(1,activation='linear',name='output'))
	optimizer = keras.optimizers.Adam(lr=0.01,beta_1=0.9,beta_2=0.999,epsilon= 1e-8)
	model.compile(loss='mse',optimizer=optimizer,metrics=['mse'])
	return model

...
tuner = RandomSearch(modelbuilder,objective='val_loss',max_trials=100,max_model_size=1800,overwrite=True)
tuner.search(x=train_data,y=train_target,epochs=3,validation_data=(valid_data,valid_target),verbose=0)

Giving an output of:

[Trial complete] [Trial summary]

Hp values: |-width_1: 11 |-Score: 0.08742512226104736 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 12 |-Score: 0.09093187510967254 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 8 |-Score: 0.09126158863306046 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 14 |-Score: 0.10759384512901306 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 6 |-Score: 0.08583792209625245 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 10 |-Score: 0.09473302498459817 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 3 |-Score: 0.13883694425225257 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 13 |-Score: 0.08331702768802643 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 7 |-Score: 0.10881250619888305 |-Best step: 0 [Trial complete] [Trial summary] Hp values: |-width_1: 17 |-Score: 0.09528512597084045 |-Best step: 0 [Warning] Oversized model: 2041 parameters – skipping [Warning] Oversized model: 2041 parameters – skipping [Warning] Oversized model: 2041 parameters – skipping [Warning] Oversized model: 2041 parameters – skipping [Warning] Oversized model: 2041 parameters – skipping [Warning] Oversized model: 2041 parameters – skipping

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

5reactions
bberlocommented, May 8, 2021

Hey everyone,

I have created a dirty fix that can be used with models created with either the Keras Sequential or Functional API (i.e. you should at least be able to use the Keras backend method count_params on your model). This dirty fix can be used while we wait for a general solution to be envisioned and developed.

The basic idea is, once you have instantiated a high level tuner class (e.g. BayesianOptimization), to overwrite a few methods inherited from parent classes: the Tuner’s class _build_and_fit_model method and BaseTuner’s class on_trial_end method.

class BayesianSearchEdit(bayesian.BayesianOptimization):
    """
    TO-DO: add custom max_model_size input param to class
    def __init__(self):
        pass
    """

    def on_trial_end(self, trial):
        """A hook called after each trial is run.
        # Arguments:
            trial: A `Trial` instance.
        """
        # Send status to Logger
        if self.logger:
            self.logger.report_trial_state(trial.trial_id, trial.get_state())

        if not trial.get_state().get("status") == trial_module.TrialStatus.INVALID:
            self.oracle.end_trial(trial.trial_id, trial_module.TrialStatus.COMPLETED)

        self.oracle.update_space(trial.hyperparameters)
        # Display needs the updated trial scored by the Oracle.
        self._display.on_trial_end(self.oracle.get_trial(trial.trial_id))
        self.save()

    def _build_and_fit_model(self, trial, fit_args, fit_kwargs):
        model = self.hypermodel.build(trial.hyperparameters)
        model_size = self.maybe_compute_model_size(model)
        print("Considering model with size: {}".format(model_size))

        if model_size > CUSTOM_MAX_MODEL_SIZE:
            self.oracle.end_trial(trial.trial_id, trial_module.TrialStatus.INVALID)

            dummy_history_obj = tf.keras.callbacks.History()
            dummy_history_obj.on_train_begin()
            dummy_history_obj.history.setdefault('val_loss', []).append(2.5)
            return dummy_history_obj

        return model.fit(*fit_args, **fit_kwargs)

    def maybe_compute_model_size(self, model):
        """Compute the size of a given model, if it has been built."""
        if model.built:
            params = [tf.keras.backend.count_params(p) for p in model.trainable_weights]
            return int(np.sum(params))
        return 0

This dirty fix essentially just skips over trials. Therefore, I advice to increase the number of trials to account for the lost trials.

Regards, Bram

Updates 04/05/2021 and 08/05/2021 To prevent Keras Tuner from crashing due to GPU Out Of Memory (OOM) exceptions, you can add exception handling to the model.fit method call (only tested with TensorFlow 2.3.0 so far):

        try:
            return model.fit(*fit_args, **fit_kwargs)
        except (tf.errors.ResourceExhaustedError, tf.errors.InternalError):
            self.oracle.end_trial(trial.trial_id, trial_module.TrialStatus.INVALID)

            dummy_history_obj = tf.keras.callbacks.History()
            dummy_history_obj.on_train_begin()
            dummy_history_obj.history.setdefault('val_loss', []).append(2.5)
            return dummy_history_obj

These crashes can still happen (happen less frequent with checking model size) due to the error margin present in manual model size calculation methods. In addition to the ResourceExhaustedError, in case a tf.distribute strategy is used during training, the InternalError also has to be handled because TensorFlow may raise any one of the two errors when the GPU is OOM.

After running the HyperBand tuner for a large number of trials, I discovered that the line model = self.hypermodel.build(trial.hyperparameters) was raising the RuntimeError as a result of consecutive GPU OOM errors. This error has been fixed by removing global Keras callbacks from the search method and including them locally in the fit_kwargs argument in the _build_and_fit_model method, e.g.:

        fit_kwargs["callbacks"].extend([
            tf.keras.callbacks.EarlyStopping(monitor="val_loss", min_delta=0, patience=5, restore_best_weights=True)
        ])
4reactions
ghostcommented, Apr 8, 2020

Looks like this may be the culprit for crashing AutoKeras resulting in OOM errors that I’ve been facing. It is a huge problem for me at the moment. Any further updates regarding this issue? https://github.com/keras-team/autokeras/issues/1078

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to skip problematic hyperparameter combinations when ...
I myself have been looking for a solution to this problem for a very long time and found it. Yes, not very elegant,...
Read more >
Keras Tuner: Lessons Learned From Tuning Hyperparameters ...
Keras Tuner did an incredible job finding the best set for model parameters, showing a twofold increase in metric growth; · We, as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found