TPE suggestion service failing to provide proposals for new trials
See original GitHub issue/kind bug
I am running an experiment with algorithm tpe and the following search space:
[ { "feasibleSpace": { "max": "1", "min": "0" }, "name": "r1", "parameterType": "double" }, { "feasibleSpace": { "max": "1", "min": "0" }, "name": "r2", "parameterType": "double" }, { "feasibleSpace": { "max": "1", "min": "0" }, "name": "r3", "parameterType": "double" } ]
and a maxTrialCount
of 10,000. The experiment successfully runs 2,057 trials but fails to propose new ones. Looking at the suggestion pod logs I see:
ERROR:grpc._server:Exception calling application: list index out of range Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/grpc/_server.py", line 434, in _call_behavior response_or_iterator = behavior(argument, context) File "/opt/katib/pkg/suggestion/v1alpha3/hyperopt/service.py", line 38, in GetSuggestions new_assignments = self.base_service.getSuggestions(trials, request.request_number) File "/opt/katib/pkg/suggestion/v1alpha3/hyperopt/base_service.py", line 220, in getSuggestions vals = new_trials[i]['misc']['vals'] IndexError: list index out of range
What did you expect to happen: I expected suggestions for the next 1000 trials to be proposed. It seems unlikely that the search space should be exhausted in ~2k trials.
Environment: Running in Google Cloud Kubernetes engine
- Kubeflow version (
kfctl version
): 1.0.0 - Kubernetes version: (use
kubectl version
):Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.42", GitCommit:"42bef28c2031a74fc68840fce56834ff7ea08518", GitTreeState:"clean", BuildDate:"2020-06-02T16:07:00Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
Thanks for the reply @andreyvelich! Yes, I see multiple restarts for the experiment and the explanation sounds reasonable. I’ll try your fixes. Let me know if there are any settings you would still be interested in seeing.
@gaocegege I think because we specify default memory, cpu and disk storage for the hyperopt suggestion’s pod. When Experiment is large, hyperopt might need more memory to execute.