Grid Search stuck when parallelTrialCount < maxTrialCount
See original GitHub issue/kind bug
What steps did you take and what happened:
I created an experiment to perform a grid search, set maxTrialCount
equal to the size of the grid, and set parallelTrialCount < maxTrialCount
.
The first set of parallel trials completes successfully, but at some point after that the experiment gets stuck. The suggestion shows requested > assigned and has the following warning:
Warning ReconcileError 4s (x12 over 31s) suggestion-controller The response contains unexpected trials
What did you expect to happen: Experiment to execute trials for each point in the parameter space and then complete.
Anything else you would like to add:
I am running this image for the katib-controller: gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:latest (I also tested with gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:v0.8.0)
The issue seems like it is likely related to https://github.com/kubeflow/katib/issues/1494
If I set parallelTrialCount >= maxTrialCount
(enabling all runs to execute immediately) the experiment succeeds.
Environment:
- Kubeflow version (
kfctl version
): v1.0.1 - Minikube version (
minikube version
): N/A - Kubernetes version: (use
kubectl version
): v1.18.17-gke.100 - OS (e.g. from
/etc/os-release
):
experiment.yaml:
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
namespace: sid
labels:
controller-tools.k8s.io: "1.0"
name: minimal-grid
spec:
objective:
type: maximize
goal: 999999
objectiveMetricName: value
resumePolicy: Never
algorithm:
algorithmName: grid
parallelTrialCount: 5
maxTrialCount: 15
maxFailedTrialCount: 3
parameters:
- name: --param_1
parameterType: int
feasibleSpace:
min: "0"
max: "15"
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
template:
spec:
containers:
- name: {{.Trial}}
image: <IMAGE>
command:
- "python3"
- "return_param_value.py"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
Result of kubectl describe suggestion <suggestion-name>
:
Name: minimal-grid
Namespace: sid
Labels: controller-tools.k8s.io=1.0
Annotations: <none>
API Version: kubeflow.org/v1alpha3
Kind: Suggestion
Metadata:
Creation Timestamp: 2021-05-18T05:58:39Z
Generation: 9
Managed Fields:
API Version: kubeflow.org/v1alpha3
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:labels:
.:
f:controller-tools.k8s.io:
f:ownerReferences:
f:spec:
.:
f:algorithmName:
f:requests:
f:status:
.:
f:conditions:
f:startTime:
f:suggestionCount:
f:suggestions:
Manager: katib-controller
Operation: Update
Time: 2021-05-18T05:58:56Z
Owner References:
API Version: kubeflow.org/v1alpha3
Block Owner Deletion: true
Controller: true
Kind: Experiment
Name: minimal-grid
UID: 90e538b2-5858-45d2-af0e-38c0837d6a8d
Resource Version: 69731263
Self Link: /apis/kubeflow.org/v1alpha3/namespaces/sid/suggestions/minimal-grid
UID: 3875d508-b3ee-4426-88a7-1b307c381249
Spec:
Algorithm Name: grid
Requests: 13
Status:
Conditions:
Last Transition Time: 2021-05-18T05:58:39Z
Last Update Time: 2021-05-18T05:58:39Z
Message: Suggestion is created
Reason: SuggestionCreated
Status: True
Type: Created
Last Transition Time: 2021-05-18T05:58:54Z
Last Update Time: 2021-05-18T05:58:54Z
Message: Deployment is ready
Reason: DeploymentReady
Status: True
Type: DeploymentReady
Last Transition Time: 2021-05-18T05:58:55Z
Last Update Time: 2021-05-18T05:58:55Z
Message: Suggestion is running
Reason: SuggestionRunning
Status: True
Type: Running
Start Time: 2021-05-18T05:58:39Z
Suggestion Count: 8
Suggestions:
Name: minimal-grid-dq75pd5j
Parameter Assignments:
Name: --param_1
Value: 0
Name: minimal-grid-jsl78k9b
Parameter Assignments:
Name: --param_1
Value: 1
Name: minimal-grid-nbhlss7f
Parameter Assignments:
Name: --param_1
Value: 2
Name: minimal-grid-p87q65dk
Parameter Assignments:
Name: --param_1
Value: 3
Name: minimal-grid-dt2s5jd5
Parameter Assignments:
Name: --param_1
Value: 4
Name: minimal-grid-jxtrdnk9
Parameter Assignments:
Name: --param_1
Value: 6
Name: minimal-grid-l2t2gpc8
Parameter Assignments:
Name: --param_1
Value: 9
Name: minimal-grid-4dpdlj2v
Parameter Assignments:
Name: --param_1
Value: 13
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ReconcileError 0s (x8 over 15s) suggestion-controller The response contains unexpected trials
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:12 (8 by maintainers)
Top GitHub Comments
I agree. This is a bug which leaks out certain trials.
/area release