Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tuner intermittently failing

See original GitHub issue

If the bug is related to a specific library below, please raise an issue in the respective repo directly:

TensorFlow Data Validation Repo

TensorFlow Model Analysis Repo

TensorFlow Transform Repo

TensorFlow Serving Repo

System information

Have I specified the code to reproduce the issue (Yes, No): Yes
Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Kubeflow through Vertex
TensorFlow version: 2.7
TFX Version: 1.6.1
Python version: 3.7
Python dependencies (from pip freeze output):

Describe the current behavior Tuner intermittently failing.

Describe the expected behavior Shouldn’t fail

Other info / logs Error Best HyperParameters: {‘space’: [{‘class_name’: ‘Choice’, ‘config’: {‘name’: ‘learning_rate’, ‘default’: 0.0001, ‘conditions’: [], ‘values’: [0.0001, 0.001, 0.01, 0.1, 0.2], ‘ordered’: True}}], ‘values’: {‘learning_rate’: 0.01}} "

Error Best Hyperparameters are written to gs://…/Tuner_Logistic_Regression_2754698840043945984/best_hyperparameters/best_hyperparameters.txt. "

Error Terminating chief oracle at PID: 16 "

I was finding approximately 1 in every 5 runs were failing with the above logs in Vertex. Looking into the issue further I noticed I had a strange setup in my code:

My Vertex Tuner had num_parallel_trials set as 3 as below: return tfx.extensions.google_cloud_ai_platform.Tuner( module_file=model_trainer, examples=transform.outputs['transformed_examples'], transform_graph=transform.outputs['transform_graph'], schema=schema, train_args=tfx.proto.TrainArgs(num_steps=train_num_steps), eval_args=tfx.proto.EvalArgs(num_steps=eval_num_steps), tune_args=tfx.proto.TuneArgs( # num_parallel_trials=3 means that 3 search loops are # running in parallel. num_parallel_trials=3), custom_config=custom_config).with_id(tuner_id)

But where I was just trying to keep processing time to a minimum while trying out TFX and Vertex I set my Tuner’s max_trails to 2. So less than the num_parallel_trials: tuner = kt.RandomSearch( hypermodel=hypermodel, max_trials=2, hyperparameters=hyperparams, seed=123, allow_new_entries=False, objective=kt.Objective('val_binary_accuracy', 'max'), directory=fn_args.working_dir, project_name=project_name)

I’ve been able to stop the issue by increasing the max_trials to 3, but ideally this wouldn’t be necessary or some kind of warning / error describing the issue with the setup.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

SylvainGavoillecommented, May 5, 2022

To complete the issue, here is an example of code reproducing the problem. The phenomenon is random and occurs one time over 6 pipeline runs with the code presented in this notebook.

Thank you for your help.

0reactions

tanguycdlscommented, May 16, 2022

Hello @pindinagesh, could you take a look @SylvainGavoille did a reproducible example that fails one over 6 pipeline runs. Can you take a look and tell us if it’s an issue in KerasTuner side or here ? but the current python code fails in TFX side.

Thanks,