Keras Tuner stopped at the end of first epoch when train on TPU
See original GitHub issueThis is discovered during test of https://www.kaggle.com/kivlichangoogle/jigsaw-multilingual-getting-started. Note that this issue only happens when TPU is enabled for the notebook.
When Keras tuner is added, the search process will fail at the start of send epoch, when saving/loading checkpoint. I think the root cause is in TF and TPU, but a temp walk around in keras tuner will be nice, eg disable checkpointing if possible.
The error stack is like below:
---------------------------------------------------------------------------
UnimplementedError Traceback (most recent call last)
<ipython-input-14-a9734e750f48> in <module>
6 verbose=1,
7 validation_data=nonenglish_val_datasets['Combined'],
----> 8 validation_steps=100)
/opt/conda/lib/python3.7/site-packages/kerastuner/engine/base_tuner.py in search(self, *fit_args, **fit_kwargs)
128
129 self.on_trial_begin(trial)
--> 130 self.run_trial(trial, *fit_args, **fit_kwargs)
131 self.on_trial_end(trial)
132 self.on_search_end()
/opt/conda/lib/python3.7/site-packages/kerastuner/engine/multi_execution_tuner.py in run_trial(self, trial, *fit_args, **fit_kwargs)
94
95 model = self.hypermodel.build(trial.hyperparameters)
---> 96 history = model.fit(*fit_args, **copied_fit_kwargs)
97 for metric, epoch_values in history.history.items():
98 if self.oracle.objective.direction == 'min':
/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
817 max_queue_size=max_queue_size,
818 workers=workers,
--> 819 use_multiprocessing=use_multiprocessing)
820
821 def evaluate(self,
/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
340 mode=ModeKeys.TRAIN,
341 training_context=training_context,
--> 342 total_epochs=epochs)
343 cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN)
344
/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs)
126 step=step, mode=mode, size=current_batch_size) as batch_logs:
127 try:
--> 128 batch_outs = execution_function(iterator)
129 except (StopIteration, errors.OutOfRangeError):
130 # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?
/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in execution_function(input_fn)
96 # `numpy` translates Tensors to values in Eager mode.
97 return nest.map_structure(_non_none_constant_value,
---> 98 distributed_function(input_fn))
99
100 return execution_function
/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/util/nest.py in map_structure(func, *structure, **kwargs)
566
567 return pack_sequence_as(
--> 568 structure[0], [func(*x) for x in entries],
569 expand_composites=expand_composites)
570
/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/util/nest.py in <listcomp>(.0)
566
567 return pack_sequence_as(
--> 568 structure[0], [func(*x) for x in entries],
569 expand_composites=expand_composites)
570
/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in _non_none_constant_value(v)
128
129 def _non_none_constant_value(v):
--> 130 constant_value = tensor_util.constant_value(v)
131 return constant_value if constant_value is not None else v
132
/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_util.py in constant_value(tensor, partial)
820 """
821 if isinstance(tensor, ops.EagerTensor):
--> 822 return tensor.numpy()
823 if not is_tensor(tensor):
824 return tensor
/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in numpy(self)
940 """
941 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
--> 942 maybe_arr = self._numpy() # pylint: disable=protected-access
943 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
944
/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in _numpy(self)
908 return self._numpy_internal()
909 except core._NotOkStatusException as e:
--> 910 six.raise_from(core._status_to_exception(e.code, e.message), None)
911
912 @property
/opt/conda/lib/python3.7/site-packages/six.py in raise_from(value, from_value)
UnimplementedError: File system scheme '[local]' not implemented (file: 'keras-tuner-dir/jigsaw-multilingual/trial_5a9ddcb2f29ca8ba91966d1ca1862a84/checkpoints/epoch_0/checkpoint_temp_59890bc9cb8a40159f0a31dc22a070ff')
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:20 (6 by maintainers)
Top Results From Across the Web
Why does Keras halt at the first epoch when I attempt to train it ...
The issue is that when I run my script to train the model, it works fine until the actual training begins. Here, it...
Read more >Starter - Keras Tuner, EfficientNet, TPU | Kaggle
The TPUs provided on Kaggle makes training extremely fast, allowing us to explore more ... Keras-Tuner is a convenient solution for hyperparameter tuning....
Read more >Introduction to the Keras Tuner | TensorFlow Core
In this tutorial, you will use the Keras Tuner to perform hypertuning for an ... Create a callback to stop training early after...
Read more >Keras FAQ
How can I train a Keras model on TPU? Where is the Keras configuration file stored? How to do hyperparameter tuning with Keras?...
Read more >How to Checkpoint Deep Learning Models in Keras
This training process stopped after epoch 22 as no better accuracy was achieved for the last five epochs. Loading a Check-Pointed Neural Network ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You need to use
strategy = tf.distribute.TPUStrategy()
instead oftf.distribute.experimental.TPUStrategy()
. That being said, it looks like the checkpointing callbacks are indeed trying to save h5 files (which should solve the TPU local file problem), but for some reason the checkpoints are not actually being saved here. I may need to look into this a bit more, but I did get it working on colab. I will share the example later after I edit it a bit.To add …
I don’t use Google Colab, but Kaggle. Using TPU, I get that same error “File system scheme ‘[local]’ not implemented”, when the tuner tries to write the checkpoints on Kaggle’s working directory.
Since I don’t have a gs://location, I just “modified” the function called by Keras Tuner to save checkpoints, to allow writing to local dir, which is the Kaggle working directory. I used patch() to mock the function.
First important thing is that Keras Tuner must be version 1.1.2 and above.
Example:
Since I used ‘with patch’, after all is done, it reverts back to the original code automatically.