Debugging RuntimeError: cholesky_cpu: For batch 2: U(77,77) is zero, singular U.
See original GitHub issueThis is an issue related to https://github.com/facebook/Ax/issues/228, https://github.com/facebook/Ax/issues/99, https://github.com/facebook/Ax/issues/308 and maybe some others as welll, but I’m more interested in two things:
- Is there a way to debug what is causing the failure? I think it is related to the bad-conditioning on the underlying GP model, but I’m not sure how to confirm this.
- When given a set of parameters for a trial generated from the model, is there a way to sample repeatedly in the neighborhood and return a mean and variance for the trial?
Using the service API, my experiment is setup as follows:
ax_client.create_experiment(
name="test_dct_slpEnrg",
parameters=[
{
"name" : "w1",
"type" : "range",
"value_type" : "float",
"bounds" : [1.0e-1, 1.0e2]
},
{
"name" : "w2",
"type" : "range",
"value_type" : "float",
"bounds" : [1.0, 1.0e2]
},
{
"name" : "w3",
"type" : "range",
"value_type" : "float",
"bounds" : [1.0e-3, 1.0]
},
{
"name" : "w4",
"type" : "range",
"value_type" : "int",
"bounds" : [10, 20]
},
{
"name" : "w5",
"type" : "range",
"value_type" : "int",
"bounds" : [2, 20]
}
],
objective_name ="Tc2_slpEnrg",
minimize=True,
parameter_constraints = [ "w4 >= w5", "w2 - w1 >= 0"
],
outcome_constraints = ["slp_speed <= 3", "engn_trq >= 0.001"],
choose_generation_strategy_kwargs=
{
"num_initialization_trials" : num_init,
"winsorize_botorch_model": True,
"winsorization_limits": (0.0, 0.3)
}
)
The sampled parameters are input into an evaluation function which internally runs an optimization routine which either converges and outputs a valid Tc2_slpEnrg, slp_speed, engn_trq
value. A valid value is indicated by a slp_speed <=3
which I have also placed as an outcome_constraint
. I was unsure of how to deal with parameter values which were ‘invalid (non-convergence)’ as discussed in https://github.com/facebook/Ax/issues/372.
Currently, the approach I am taking is for the intial Sobel steps, I use abandon_trial
for values which do not converge and after the Sobel steps, in order to discourage the model for sampling from nearby-parameters which ended up being invalid, I set the objective value to a high value of 3000
which is not too high, but very unlikely to normally occur.
I think this is this is the main reason why the instability is occurring as nearby values can be very noisy and the objective can jump between ranges of 1000 to 3000, despite very small changes in the parameters. This is why I’d like to sample from a small neighborhood around the generated trial parameter and compute a mean to return as the value. I’m unsure if Ax supports this feature or if it’s something I would need to set through Botorch.
However, I have also tried to abandon these parameter values (during the GPEI step) and I would still run into these errors, so I am unsure what the actual issue is and how to resolve it.
Here is a snippet of the trace when the error occurs, note that I am periodically outputting the best parameter values so far since it completely fails when this Runtime Error occurs:
[INFO 09-04 11:32:32] ax.service.ax_client: Generated new trial 630 with parameters {'w1': 19.57, 'w2': 72.48, 'w3': 0.77, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:32:33] ax.service.ax_client: Completed trial 630 with data: {'Tc2_slpEnrg': (1087.14, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.51, 0.0), 'engn_trq': (12.7, 0.0)}.
Completed 125 of 500 trials
[INFO 09-04 11:32:36] ax.service.ax_client: Generated new trial 631 with parameters {'w1': 49.14, 'w2': 59.68, 'w3': 0.27, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:32:37] ax.service.ax_client: Completed trial 631 with data: {'Tc2_slpEnrg': (1082.28, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.46, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 126 of 500 trials
[INFO 09-04 11:32:40] ax.service.ax_client: Generated new trial 632 with parameters {'w1': 37.7, 'w2': 59.59, 'w3': 0.09, 'w4': 19, 'w5': 9}.
[INFO 09-04 11:32:42] ax.service.ax_client: Completed trial 632 with data: {'Tc2_slpEnrg': (1084.8, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.39, 0.0), 'engn_trq': (12.47, 0.0)}.
Completed 127 of 500 trials
[INFO 09-04 11:32:45] ax.service.ax_client: Generated new trial 633 with parameters {'w1': 71.5, 'w2': 82.59, 'w3': 0.27, 'w4': 20, 'w5': 15}.
Did not converge: (3.869655369315524, 0.0). Setting value to 3000
[INFO 09-04 11:32:48] ax.service.ax_client: Completed trial 633 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.87, 0.0), 'engn_trq': (12.63, 0.0)}.
Completed 128 of 500 trials
[INFO 09-04 11:32:51] ax.service.ax_client: Generated new trial 634 with parameters {'w1': 45.01, 'w2': 66.33, 'w3': 0.15, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:32:52] ax.service.ax_client: Completed trial 634 with data: {'Tc2_slpEnrg': (1072.64, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.86, 0.0), 'engn_trq': (12.65, 0.0)}.
Completed 129 of 500 trials
[INFO 09-04 11:32:56] ax.service.ax_client: Generated new trial 635 with parameters {'w1': 53.86, 'w2': 58.84, 'w3': 0.06, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:32:57] ax.service.ax_client: Completed trial 635 with data: {'Tc2_slpEnrg': (1087.49, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (1.98, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 130 of 500 trials
[INFO 09-04 11:33:00] ax.service.ax_client: Generated new trial 636 with parameters {'w1': 43.28, 'w2': 67.07, 'w3': 0.29, 'w4': 19, 'w5': 9}.
[INFO 09-04 11:33:01] ax.service.ax_client: Completed trial 636 with data: {'Tc2_slpEnrg': (1083.72, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.41, 0.0), 'engn_trq': (12.47, 0.0)}.
Completed 131 of 500 trials
Best params: {'w1': 71.1208035618998, 'w2': 82.38559271674603, 'w3': 0.23882054011337459, 'w4': 20, 'w5': 15} {'slp_speed': 2.3737070108366964, 'engn_trq': 12.243898660289364, 'Tc2_slpEnrg': 1030.4849595920462, 'max_abs_Jerk': 4.059934646446776}
Completed 131 of 500 trials
[INFO 09-04 11:33:04] ax.service.ax_client: Generated new trial 637 with parameters {'w1': 27.6, 'w2': 64.84, 'w3': 0.2, 'w4': 16, 'w5': 9}.
[INFO 09-04 11:33:05] ax.service.ax_client: Completed trial 637 with data: {'Tc2_slpEnrg': (1075.64, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.97, 0.0), 'engn_trq': (12.57, 0.0)}.
Completed 132 of 500 trials
[INFO 09-04 11:33:09] ax.service.ax_client: Generated new trial 638 with parameters {'w1': 44.71, 'w2': 62.99, 'w3': 0.25, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:33:10] ax.service.ax_client: Completed trial 638 with data: {'Tc2_slpEnrg': (1089.75, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.05, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 133 of 500 trials
[INFO 09-04 11:33:13] ax.service.ax_client: Generated new trial 639 with parameters {'w1': 20.52, 'w2': 70.79, 'w3': 0.8, 'w4': 17, 'w5': 9}.
Did not converge: (3.0800594621335904, 0.0). Setting value to 3000
[INFO 09-04 11:33:14] ax.service.ax_client: Completed trial 639 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.08, 0.0), 'engn_trq': (12.74, 0.0)}.
Completed 134 of 500 trials
[INFO 09-04 11:33:18] ax.service.ax_client: Generated new trial 640 with parameters {'w1': 36.74, 'w2': 60.21, 'w3': 0.43, 'w4': 15, 'w5': 9}.
Did not converge: (86.95642479778826, 0.0). Setting value to 3000
[INFO 09-04 11:33:18] ax.service.ax_client: Completed trial 640 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (2.26, 0.0), 'slp_speed': (86.96, 0.0), 'engn_trq': (70.0, 0.0)}.
Completed 135 of 500 trials
[INFO 09-04 11:33:22] ax.service.ax_client: Generated new trial 641 with parameters {'w1': 13.41, 'w2': 66.27, 'w3': 0.18, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:33:23] ax.service.ax_client: Completed trial 641 with data: {'Tc2_slpEnrg': (1073.76, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.88, 0.0), 'engn_trq': (12.65, 0.0)}.
Completed 136 of 500 trials
[INFO 09-04 11:33:27] ax.service.ax_client: Generated new trial 642 with parameters {'w1': 28.77, 'w2': 66.15, 'w3': 0.22, 'w4': 16, 'w5': 9}.
[INFO 09-04 11:33:28] ax.service.ax_client: Completed trial 642 with data: {'Tc2_slpEnrg': (1074.72, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.98, 0.0), 'engn_trq': (12.57, 0.0)}.
Completed 137 of 500 trials
[INFO 09-04 11:33:32] ax.service.ax_client: Generated new trial 643 with parameters {'w1': 25.92, 'w2': 67.46, 'w3': 0.79, 'w4': 17, 'w5': 9}.
Did not converge: (3.105967542260089, 0.0). Setting value to 3000
[INFO 09-04 11:33:33] ax.service.ax_client: Completed trial 643 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.11, 0.0), 'engn_trq': (12.74, 0.0)}.
Completed 138 of 500 trials
[INFO 09-04 11:33:37] ax.service.ax_client: Generated new trial 644 with parameters {'w1': 37.94, 'w2': 66.57, 'w3': 0.32, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:33:38] ax.service.ax_client: Completed trial 644 with data: {'Tc2_slpEnrg': (1086.47, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.48, 0.0), 'engn_trq': (12.61, 0.0)}.
Completed 139 of 500 trials
[INFO 09-04 11:33:41] ax.service.ax_client: Generated new trial 645 with parameters {'w1': 38.19, 'w2': 65.75, 'w3': 0.19, 'w4': 18, 'w5': 9}.
[INFO 09-04 11:33:43] ax.service.ax_client: Completed trial 645 with data: {'Tc2_slpEnrg': (1072.85, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.8, 0.0), 'engn_trq': (12.71, 0.0)}.
Completed 140 of 500 trials
[INFO 09-04 11:33:47] ax.service.ax_client: Generated new trial 646 with parameters {'w1': 37.22, 'w2': 65.46, 'w3': 0.23, 'w4': 17, 'w5': 8}.
[INFO 09-04 11:33:47] ax.service.ax_client: Completed trial 646 with data: {'Tc2_slpEnrg': (1085.76, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.56, 0.0), 'engn_trq': (12.53, 0.0)}.
Completed 141 of 500 trials
Best params: {'w1': 71.1208035618998, 'w2': 82.38559271674603, 'w3': 0.23882054011337459, 'w4': 20, 'w5': 15} {'slp_speed': 2.373707491338672, 'engn_trq': 12.243896324181325, 'Tc2_slpEnrg': 1030.4849537960213, 'max_abs_Jerk': 4.059934784356625}
Completed 141 of 500 trials
Traceback (most recent call last):
File "/home/mlab/gitRepo/cvt_opt/cvt_bayes_opt/dct_service_debug.py", line 170, in <module>
trial_params, trial_index = ax_client.get_next_trial()
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/service/ax_client.py", line 275, in get_next_trial
trial = self.experiment.new_trial(generator_run=self._gen_new_generator_run())
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/service/ax_client.py", line 865, in _gen_new_generator_run
experiment=self.experiment
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/generation_strategy.py", line 376, in gen
keywords=get_function_argument_names(model.gen),
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/base.py", line 626, in gen
model_gen_options=model_gen_options,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/array.py", line 238, in _gen
target_fidelities=target_fidelities,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/torch.py", line 260, in _model_best_point
target_fidelities=target_fidelities,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 458, in best_point
target_fidelities=target_fidelities,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch_defaults.py", line 353, in recommend_best_observed_point
options=model_gen_options,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/model_utils.py", line 296, in best_observed_point
options=options,
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/model_utils.py", line 399, in best_in_sample_point
f, cov = as_array(model.predict(X_obs))
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 314, in predict
return self.model_predictor(model=self.model, X=X) # pyre-ignore [28]
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/utils.py", line 454, in predict_from_model
posterior = model.posterior(X)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/botorch/models/gpytorch.py", line 301, in posterior
mvn = self(X)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_gp.py", line 328, in __call__
predictive_mean, predictive_covar = self.prediction_strategy.exact_prediction(full_mean, full_covar)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 302, in exact_prediction
self.exact_predictive_mean(test_mean, test_train_covar),
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 320, in exact_predictive_mean
res = (test_train_covar @ self.mean_cache.unsqueeze(-1)).squeeze(-1)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
add_to_cache(self, cache_name, method(self, *args, **kwargs))
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 269, in mean_cache
mean_cache = train_train_covar.inv_matmul(train_labels_offset).squeeze(-1)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 934, in inv_matmul
return func.apply(self.representation_tree(), False, right_tensor, *self.representation())
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 47, in forward
solves = _solve(lazy_tsr, right_tensor)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 11, in _solve
return lazy_tsr._cholesky()._cholesky_solve(rhs)
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
add_to_cache(self, cache_name, method(self, *args, **kwargs))
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 414, in _cholesky
cholesky = psd_safe_cholesky(evaluated_mat).contiguous()
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 48, in psd_safe_cholesky
raise e
File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 25, in psd_safe_cholesky
L = torch.cholesky(A, upper=upper, out=out)
RuntimeError: cholesky_cpu: For batch 2: U(99,99) is zero, singular U.
Please let me know what your thoughts are about my problem and how I should proceed. Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
I have 3 suggestions to start, but I’ll need to follow up after more research:
Closing this and moving discussion to #228, since the ways to address the issue are mostly the same in the two cases.