question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Keras Tuner stopped at the end of first epoch when train on TPU

See original GitHub issue

This is discovered during test of https://www.kaggle.com/kivlichangoogle/jigsaw-multilingual-getting-started. Note that this issue only happens when TPU is enabled for the notebook.

When Keras tuner is added, the search process will fail at the start of send epoch, when saving/loading checkpoint. I think the root cause is in TF and TPU, but a temp walk around in keras tuner will be nice, eg disable checkpointing if possible.

The error stack is like below:

---------------------------------------------------------------------------
UnimplementedError                        Traceback (most recent call last)
<ipython-input-14-a9734e750f48> in <module>
      6              verbose=1,
      7              validation_data=nonenglish_val_datasets['Combined'],
----> 8              validation_steps=100)

/opt/conda/lib/python3.7/site-packages/kerastuner/engine/base_tuner.py in search(self, *fit_args, **fit_kwargs)
    128 
    129             self.on_trial_begin(trial)
--> 130             self.run_trial(trial, *fit_args, **fit_kwargs)
    131             self.on_trial_end(trial)
    132         self.on_search_end()

/opt/conda/lib/python3.7/site-packages/kerastuner/engine/multi_execution_tuner.py in run_trial(self, trial, *fit_args, **fit_kwargs)
     94 
     95             model = self.hypermodel.build(trial.hyperparameters)
---> 96             history = model.fit(*fit_args, **copied_fit_kwargs)
     97             for metric, epoch_values in history.history.items():
     98                 if self.oracle.objective.direction == 'min':

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
    817         max_queue_size=max_queue_size,
    818         workers=workers,
--> 819         use_multiprocessing=use_multiprocessing)
    820 
    821   def evaluate(self,

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
    340                 mode=ModeKeys.TRAIN,
    341                 training_context=training_context,
--> 342                 total_epochs=epochs)
    343             cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN)
    344 

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs)
    126         step=step, mode=mode, size=current_batch_size) as batch_logs:
    127       try:
--> 128         batch_outs = execution_function(iterator)
    129       except (StopIteration, errors.OutOfRangeError):
    130         # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in execution_function(input_fn)
     96     # `numpy` translates Tensors to values in Eager mode.
     97     return nest.map_structure(_non_none_constant_value,
---> 98                               distributed_function(input_fn))
     99 
    100   return execution_function

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/util/nest.py in map_structure(func, *structure, **kwargs)
    566 
    567   return pack_sequence_as(
--> 568       structure[0], [func(*x) for x in entries],
    569       expand_composites=expand_composites)
    570 

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/util/nest.py in <listcomp>(.0)
    566 
    567   return pack_sequence_as(
--> 568       structure[0], [func(*x) for x in entries],
    569       expand_composites=expand_composites)
    570 

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in _non_none_constant_value(v)
    128 
    129 def _non_none_constant_value(v):
--> 130   constant_value = tensor_util.constant_value(v)
    131   return constant_value if constant_value is not None else v
    132

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_util.py in constant_value(tensor, partial)
    820   """
    821   if isinstance(tensor, ops.EagerTensor):
--> 822     return tensor.numpy()
    823   if not is_tensor(tensor):
    824     return tensor

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in numpy(self)
    940     """
    941     # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
--> 942     maybe_arr = self._numpy()  # pylint: disable=protected-access
    943     return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
    944 

/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in _numpy(self)
    908       return self._numpy_internal()
    909     except core._NotOkStatusException as e:
--> 910       six.raise_from(core._status_to_exception(e.code, e.message), None)
    911 
    912   @property

/opt/conda/lib/python3.7/site-packages/six.py in raise_from(value, from_value)

UnimplementedError: File system scheme '[local]' not implemented (file: 'keras-tuner-dir/jigsaw-multilingual/trial_5a9ddcb2f29ca8ba91966d1ca1862a84/checkpoints/epoch_0/checkpoint_temp_59890bc9cb8a40159f0a31dc22a070ff')
	Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:4
  • Comments:20 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
yixingfucommented, Aug 2, 2020

@yixingfu Sorry for the late reply, https://colab.research.google.com/drive/1tQuB6v-_b09lQQN7EkQeV9CzvndCO6SZ?usp=sharing

You need to use strategy = tf.distribute.TPUStrategy() instead of tf.distribute.experimental.TPUStrategy(). That being said, it looks like the checkpointing callbacks are indeed trying to save h5 files (which should solve the TPU local file problem), but for some reason the checkpoints are not actually being saved here. I may need to look into this a bit more, but I did get it working on colab. I will share the example later after I edit it a bit.

0reactions
josephramoncommented, May 24, 2022

To add …

I don’t use Google Colab, but Kaggle. Using TPU, I get that same error “File system scheme ‘[local]’ not implemented”, when the tuner tries to write the checkpoints on Kaggle’s working directory.

Since I don’t have a gs://location, I just “modified” the function called by Keras Tuner to save checkpoints, to allow writing to local dir, which is the Kaggle working directory. I used patch() to mock the function.

First important thing is that Keras Tuner must be version 1.1.2 and above.

Example:

from mock import patch

<your code>

# now the new function to "replace" the existing one (keras_tuner.engine.tuner_utils.SaveBestEpoch.on_epoch_end)

    def new_on_epoch_end(self, epoch, logs=None):
        if not self.objective.has_value(logs):
            # Save on every epoch if metric value is not in the logs. Either no
            # objective is specified, or objective is computed and returned
            # after `fit()`.

            #***** the following are the lines I added ******************************************
            # Save model in Tensorflow's "SavedModel" format
            
            save_locally = tf.saved_model.SaveOptions(experimental_io_device = '/job:localhost')
            
            # I then added ', options = save_locally' to the line below.
            #************************************************************************************
        
            self.model.save_weights(self.filepath, options = save_locally)
            return
        current_value = self.objective.get_value(logs)
        if self.objective.better_than(current_value, self.best_value):
            self.best_value = current_value
            
            #***** the following are the lines I added ******************************************
            # Save model in Tensorflow's "SavedModel" format
            
            save_locally = tf.saved_model.SaveOptions(experimental_io_device = '/job:localhost')
            
            # I then added ', options = save_locally' to the line below.
            #************************************************************************************
            
            self.model.save_weights(self.filepath, options = save_locally)    
    

    with patch('keras_tuner.engine.tuner_utils.SaveBestEpoch.on_epoch_end', new_on_epoch_end):
        # Perform hypertuning.  The parameters are exactly like those in the fit() method.
        tuner.search(
            X_train,
            y_train,
            epochs=num_of_epochs,
            validation_data = (X_valid, y_valid), 
            callbacks=[early_stopping]   
            )

<more of your code>

Since I used ‘with patch’, after all is done, it reverts back to the original code automatically.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does Keras halt at the first epoch when I attempt to train it ...
The issue is that when I run my script to train the model, it works fine until the actual training begins. Here, it...
Read more >
Starter - Keras Tuner, EfficientNet, TPU | Kaggle
The TPUs provided on Kaggle makes training extremely fast, allowing us to explore more ... Keras-Tuner is a convenient solution for hyperparameter tuning....
Read more >
Introduction to the Keras Tuner | TensorFlow Core
In this tutorial, you will use the Keras Tuner to perform hypertuning for an ... Create a callback to stop training early after...
Read more >
Keras FAQ
How can I train a Keras model on TPU? Where is the Keras configuration file stored? How to do hyperparameter tuning with Keras?...
Read more >
How to Checkpoint Deep Learning Models in Keras
This training process stopped after epoch 22 as no better accuracy was achieved for the last five epochs. Loading a Check-Pointed Neural Network ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found