Tuner intermittently failing
See original GitHub issueIf the bug is related to a specific library below, please raise an issue in the respective repo directly:
TensorFlow Data Validation Repo
TensorFlow Model Analysis Repo
System information
- Have I specified the code to reproduce the issue (Yes, No): Yes
- Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Kubeflow through Vertex
- TensorFlow version: 2.7
- TFX Version: 1.6.1
- Python version: 3.7
- Python dependencies (from
pip freeze
output):
Describe the current behavior Tuner intermittently failing.
Describe the expected behavior Shouldn’t fail
Other info / logs Error Best HyperParameters: {‘space’: [{‘class_name’: ‘Choice’, ‘config’: {‘name’: ‘learning_rate’, ‘default’: 0.0001, ‘conditions’: [], ‘values’: [0.0001, 0.001, 0.01, 0.1, 0.2], ‘ordered’: True}}], ‘values’: {‘learning_rate’: 0.01}} "
Error Best Hyperparameters are written to gs://…/Tuner_Logistic_Regression_2754698840043945984/best_hyperparameters/best_hyperparameters.txt. "
Error Terminating chief oracle at PID: 16 "
Error Terminating chief oracle at PID: 16 "
I was finding approximately 1 in every 5 runs were failing with the above logs in Vertex. Looking into the issue further I noticed I had a strange setup in my code:
My Vertex Tuner had num_parallel_trials set as 3 as below:
return tfx.extensions.google_cloud_ai_platform.Tuner( module_file=model_trainer, examples=transform.outputs['transformed_examples'], transform_graph=transform.outputs['transform_graph'], schema=schema, train_args=tfx.proto.TrainArgs(num_steps=train_num_steps), eval_args=tfx.proto.EvalArgs(num_steps=eval_num_steps), tune_args=tfx.proto.TuneArgs( # num_parallel_trials=3 means that 3 search loops are # running in parallel. num_parallel_trials=3), custom_config=custom_config).with_id(tuner_id)
But where I was just trying to keep processing time to a minimum while trying out TFX and Vertex I set my Tuner’s max_trails to 2. So less than the num_parallel_trials:
tuner = kt.RandomSearch( hypermodel=hypermodel, max_trials=2, hyperparameters=hyperparams, seed=123, allow_new_entries=False, objective=kt.Objective('val_binary_accuracy', 'max'), directory=fn_args.working_dir, project_name=project_name)
I’ve been able to stop the issue by increasing the max_trials to 3, but ideally this wouldn’t be necessary or some kind of warning / error describing the issue with the setup.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:5 (1 by maintainers)
Top GitHub Comments
To complete the issue, here is an example of code reproducing the problem. The phenomenon is random and occurs one time over 6 pipeline runs with the code presented in this notebook.
Thank you for your help.
Hello @pindinagesh, could you take a look @SylvainGavoille did a reproducible example that fails one over 6 pipeline runs. Can you take a look and tell us if it’s an issue in KerasTuner side or here ? but the current python code fails in TFX side.
Thanks,