question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Debugging RuntimeError: cholesky_cpu: For batch 2: U(77,77) is zero, singular U.

See original GitHub issue

This is an issue related to https://github.com/facebook/Ax/issues/228, https://github.com/facebook/Ax/issues/99, https://github.com/facebook/Ax/issues/308 and maybe some others as welll, but I’m more interested in two things:

  1. Is there a way to debug what is causing the failure? I think it is related to the bad-conditioning on the underlying GP model, but I’m not sure how to confirm this.
  2. When given a set of parameters for a trial generated from the model, is there a way to sample repeatedly in the neighborhood and return a mean and variance for the trial?

Using the service API, my experiment is setup as follows:

ax_client.create_experiment(
        name="test_dct_slpEnrg",
        parameters=[
            {
                "name" : "w1",
                "type" : "range",
                "value_type" : "float",
                "bounds" : [1.0e-1, 1.0e2]
            },
            {
                "name" : "w2",
                "type" : "range",
                "value_type" : "float",
                "bounds" : [1.0, 1.0e2]
            },
            {
                "name" : "w3",
                "type" : "range",
                "value_type" : "float",
                "bounds" : [1.0e-3, 1.0]
            },
            {
                "name" : "w4",
                "type" : "range",
                "value_type" : "int",
                "bounds" : [10, 20]
            },
            {
                "name" : "w5",
                "type" : "range",
                "value_type" : "int",
                "bounds" : [2, 20]
            }
        ],
        objective_name ="Tc2_slpEnrg",
        minimize=True,
        parameter_constraints = [ "w4 >= w5", "w2 - w1 >= 0"
        ],
        outcome_constraints = ["slp_speed <= 3", "engn_trq >= 0.001"],
        choose_generation_strategy_kwargs=
            {
                "num_initialization_trials" : num_init,
                "winsorize_botorch_model": True,
                "winsorization_limits": (0.0, 0.3)
            }
    )

The sampled parameters are input into an evaluation function which internally runs an optimization routine which either converges and outputs a valid Tc2_slpEnrg, slp_speed, engn_trq value. A valid value is indicated by a slp_speed <=3 which I have also placed as an outcome_constraint. I was unsure of how to deal with parameter values which were ‘invalid (non-convergence)’ as discussed in https://github.com/facebook/Ax/issues/372.

Currently, the approach I am taking is for the intial Sobel steps, I use abandon_trial for values which do not converge and after the Sobel steps, in order to discourage the model for sampling from nearby-parameters which ended up being invalid, I set the objective value to a high value of 3000 which is not too high, but very unlikely to normally occur.

I think this is this is the main reason why the instability is occurring as nearby values can be very noisy and the objective can jump between ranges of 1000 to 3000, despite very small changes in the parameters. This is why I’d like to sample from a small neighborhood around the generated trial parameter and compute a mean to return as the value. I’m unsure if Ax supports this feature or if it’s something I would need to set through Botorch.

However, I have also tried to abandon these parameter values (during the GPEI step) and I would still run into these errors, so I am unsure what the actual issue is and how to resolve it.

Here is a snippet of the trace when the error occurs, note that I am periodically outputting the best parameter values so far since it completely fails when this Runtime Error occurs:

[INFO 09-04 11:32:32] ax.service.ax_client: Generated new trial 630 with parameters {'w1': 19.57, 'w2': 72.48, 'w3': 0.77, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:32:33] ax.service.ax_client: Completed trial 630 with data: {'Tc2_slpEnrg': (1087.14, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.51, 0.0), 'engn_trq': (12.7, 0.0)}.
Completed 125 of 500 trials
[INFO 09-04 11:32:36] ax.service.ax_client: Generated new trial 631 with parameters {'w1': 49.14, 'w2': 59.68, 'w3': 0.27, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:32:37] ax.service.ax_client: Completed trial 631 with data: {'Tc2_slpEnrg': (1082.28, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.46, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 126 of 500 trials
[INFO 09-04 11:32:40] ax.service.ax_client: Generated new trial 632 with parameters {'w1': 37.7, 'w2': 59.59, 'w3': 0.09, 'w4': 19, 'w5': 9}.
[INFO 09-04 11:32:42] ax.service.ax_client: Completed trial 632 with data: {'Tc2_slpEnrg': (1084.8, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.39, 0.0), 'engn_trq': (12.47, 0.0)}.
Completed 127 of 500 trials
[INFO 09-04 11:32:45] ax.service.ax_client: Generated new trial 633 with parameters {'w1': 71.5, 'w2': 82.59, 'w3': 0.27, 'w4': 20, 'w5': 15}.
Did not converge: (3.869655369315524, 0.0). Setting value to 3000
[INFO 09-04 11:32:48] ax.service.ax_client: Completed trial 633 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.87, 0.0), 'engn_trq': (12.63, 0.0)}.
Completed 128 of 500 trials
[INFO 09-04 11:32:51] ax.service.ax_client: Generated new trial 634 with parameters {'w1': 45.01, 'w2': 66.33, 'w3': 0.15, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:32:52] ax.service.ax_client: Completed trial 634 with data: {'Tc2_slpEnrg': (1072.64, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.86, 0.0), 'engn_trq': (12.65, 0.0)}.
Completed 129 of 500 trials
[INFO 09-04 11:32:56] ax.service.ax_client: Generated new trial 635 with parameters {'w1': 53.86, 'w2': 58.84, 'w3': 0.06, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:32:57] ax.service.ax_client: Completed trial 635 with data: {'Tc2_slpEnrg': (1087.49, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (1.98, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 130 of 500 trials
[INFO 09-04 11:33:00] ax.service.ax_client: Generated new trial 636 with parameters {'w1': 43.28, 'w2': 67.07, 'w3': 0.29, 'w4': 19, 'w5': 9}.
[INFO 09-04 11:33:01] ax.service.ax_client: Completed trial 636 with data: {'Tc2_slpEnrg': (1083.72, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.41, 0.0), 'engn_trq': (12.47, 0.0)}.
Completed 131 of 500 trials
Best params: {'w1': 71.1208035618998, 'w2': 82.38559271674603, 'w3': 0.23882054011337459, 'w4': 20, 'w5': 15}  {'slp_speed': 2.3737070108366964, 'engn_trq': 12.243898660289364, 'Tc2_slpEnrg': 1030.4849595920462, 'max_abs_Jerk': 4.059934646446776}
Completed 131 of 500 trials
[INFO 09-04 11:33:04] ax.service.ax_client: Generated new trial 637 with parameters {'w1': 27.6, 'w2': 64.84, 'w3': 0.2, 'w4': 16, 'w5': 9}.
[INFO 09-04 11:33:05] ax.service.ax_client: Completed trial 637 with data: {'Tc2_slpEnrg': (1075.64, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.97, 0.0), 'engn_trq': (12.57, 0.0)}.
Completed 132 of 500 trials
[INFO 09-04 11:33:09] ax.service.ax_client: Generated new trial 638 with parameters {'w1': 44.71, 'w2': 62.99, 'w3': 0.25, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:33:10] ax.service.ax_client: Completed trial 638 with data: {'Tc2_slpEnrg': (1089.75, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.05, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 133 of 500 trials
[INFO 09-04 11:33:13] ax.service.ax_client: Generated new trial 639 with parameters {'w1': 20.52, 'w2': 70.79, 'w3': 0.8, 'w4': 17, 'w5': 9}.
Did not converge: (3.0800594621335904, 0.0). Setting value to 3000
[INFO 09-04 11:33:14] ax.service.ax_client: Completed trial 639 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.08, 0.0), 'engn_trq': (12.74, 0.0)}.
Completed 134 of 500 trials
[INFO 09-04 11:33:18] ax.service.ax_client: Generated new trial 640 with parameters {'w1': 36.74, 'w2': 60.21, 'w3': 0.43, 'w4': 15, 'w5': 9}.
Did not converge: (86.95642479778826, 0.0). Setting value to 3000
[INFO 09-04 11:33:18] ax.service.ax_client: Completed trial 640 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (2.26, 0.0), 'slp_speed': (86.96, 0.0), 'engn_trq': (70.0, 0.0)}.
Completed 135 of 500 trials
[INFO 09-04 11:33:22] ax.service.ax_client: Generated new trial 641 with parameters {'w1': 13.41, 'w2': 66.27, 'w3': 0.18, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:33:23] ax.service.ax_client: Completed trial 641 with data: {'Tc2_slpEnrg': (1073.76, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.88, 0.0), 'engn_trq': (12.65, 0.0)}.
Completed 136 of 500 trials
[INFO 09-04 11:33:27] ax.service.ax_client: Generated new trial 642 with parameters {'w1': 28.77, 'w2': 66.15, 'w3': 0.22, 'w4': 16, 'w5': 9}.
[INFO 09-04 11:33:28] ax.service.ax_client: Completed trial 642 with data: {'Tc2_slpEnrg': (1074.72, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.98, 0.0), 'engn_trq': (12.57, 0.0)}.
Completed 137 of 500 trials
[INFO 09-04 11:33:32] ax.service.ax_client: Generated new trial 643 with parameters {'w1': 25.92, 'w2': 67.46, 'w3': 0.79, 'w4': 17, 'w5': 9}.
Did not converge: (3.105967542260089, 0.0). Setting value to 3000
[INFO 09-04 11:33:33] ax.service.ax_client: Completed trial 643 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.11, 0.0), 'engn_trq': (12.74, 0.0)}.
Completed 138 of 500 trials
[INFO 09-04 11:33:37] ax.service.ax_client: Generated new trial 644 with parameters {'w1': 37.94, 'w2': 66.57, 'w3': 0.32, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:33:38] ax.service.ax_client: Completed trial 644 with data: {'Tc2_slpEnrg': (1086.47, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.48, 0.0), 'engn_trq': (12.61, 0.0)}.
Completed 139 of 500 trials
[INFO 09-04 11:33:41] ax.service.ax_client: Generated new trial 645 with parameters {'w1': 38.19, 'w2': 65.75, 'w3': 0.19, 'w4': 18, 'w5': 9}.
[INFO 09-04 11:33:43] ax.service.ax_client: Completed trial 645 with data: {'Tc2_slpEnrg': (1072.85, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.8, 0.0), 'engn_trq': (12.71, 0.0)}.
Completed 140 of 500 trials
[INFO 09-04 11:33:47] ax.service.ax_client: Generated new trial 646 with parameters {'w1': 37.22, 'w2': 65.46, 'w3': 0.23, 'w4': 17, 'w5': 8}.
[INFO 09-04 11:33:47] ax.service.ax_client: Completed trial 646 with data: {'Tc2_slpEnrg': (1085.76, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.56, 0.0), 'engn_trq': (12.53, 0.0)}.
Completed 141 of 500 trials
Best params: {'w1': 71.1208035618998, 'w2': 82.38559271674603, 'w3': 0.23882054011337459, 'w4': 20, 'w5': 15}  {'slp_speed': 2.373707491338672, 'engn_trq': 12.243896324181325, 'Tc2_slpEnrg': 1030.4849537960213, 'max_abs_Jerk': 4.059934784356625}
Completed 141 of 500 trials


Traceback (most recent call last):
  File "/home/mlab/gitRepo/cvt_opt/cvt_bayes_opt/dct_service_debug.py", line 170, in <module>
    trial_params, trial_index = ax_client.get_next_trial()
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/service/ax_client.py", line 275, in get_next_trial
    trial = self.experiment.new_trial(generator_run=self._gen_new_generator_run())
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/service/ax_client.py", line 865, in _gen_new_generator_run
    experiment=self.experiment
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/generation_strategy.py", line 376, in gen
    keywords=get_function_argument_names(model.gen),
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/base.py", line 626, in gen
    model_gen_options=model_gen_options,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/array.py", line 238, in _gen
    target_fidelities=target_fidelities,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/torch.py", line 260, in _model_best_point
    target_fidelities=target_fidelities,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 458, in best_point
    target_fidelities=target_fidelities,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch_defaults.py", line 353, in recommend_best_observed_point
    options=model_gen_options,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/model_utils.py", line 296, in best_observed_point
    options=options,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/model_utils.py", line 399, in best_in_sample_point
    f, cov = as_array(model.predict(X_obs))
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 314, in predict
    return self.model_predictor(model=self.model, X=X)  # pyre-ignore [28]
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/utils.py", line 454, in predict_from_model
    posterior = model.posterior(X)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/botorch/models/gpytorch.py", line 301, in posterior
    mvn = self(X)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_gp.py", line 328, in __call__
    predictive_mean, predictive_covar = self.prediction_strategy.exact_prediction(full_mean, full_covar)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 302, in exact_prediction
    self.exact_predictive_mean(test_mean, test_train_covar),
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 320, in exact_predictive_mean
    res = (test_train_covar @ self.mean_cache.unsqueeze(-1)).squeeze(-1)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
    add_to_cache(self, cache_name, method(self, *args, **kwargs))
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 269, in mean_cache
    mean_cache = train_train_covar.inv_matmul(train_labels_offset).squeeze(-1)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 934, in inv_matmul
    return func.apply(self.representation_tree(), False, right_tensor, *self.representation())
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 47, in forward
    solves = _solve(lazy_tsr, right_tensor)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 11, in _solve
    return lazy_tsr._cholesky()._cholesky_solve(rhs)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
    add_to_cache(self, cache_name, method(self, *args, **kwargs))
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 414, in _cholesky
    cholesky = psd_safe_cholesky(evaluated_mat).contiguous()
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 48, in psd_safe_cholesky
    raise e
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 25, in psd_safe_cholesky
    L = torch.cholesky(A, upper=upper, out=out)
RuntimeError: cholesky_cpu: For batch 2: U(99,99) is zero, singular U.

Please let me know what your thoughts are about my problem and how I should proceed. Thanks!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
2timesjaycommented, Sep 8, 2020

I have 3 suggestions to start, but I’ll need to follow up after more research:

  1. Definitely do not use this “3000” dummy value. The GP will have a very hard time fitting it. We don’t have a systematic way to deal with failed trials like this, unfortunately, except for…
  2. Add robustness, retries, repeated sampling, or other strategies to the evaluation function directly. If you followed the service API tutorial, https://ax.dev/tutorials/gpei_hartmann_service.html#3.-Define-how-to-evaluate-trials defines the “evaluate” function and can accomodate any process you want to retry evaluations and reduce variance (but we don’t offer any utilities to help this directly).
  3. If #1 + #2 reduces the number of trials you have to run and the number of non-convergent evaluations, you should be less likely to encounter Cholesky errors.
1reaction
lena-kashtelyancommented, Sep 29, 2020

Closing this and moving discussion to #228, since the ways to address the issue are mostly the same in the two cases.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Crash while optimizing: RuntimeError: cholesky_cpu: U(1,1) is ...
I got this recently trying to tune the hyperparameters on an MLP. Relevant versions: python==3.7.1 ax-platform==0.1.2 botorch==0.1.0 ...
Read more >
RuntimeError: cholesky_cpu: U(1,1) is zero, singular U - Misc.
This is related to my earlier issue which I was able to resolve thanks to @martinjankowiak. However, I am again facing the same...
Read more >
Pytorch torch.cholesky ignoring exception - Stack Overflow
First use torch. det to determine if there are any singular matrices in your batch. Then mask out those matrices.
Read more >
Error Cholesky CPU - PyTorch Forums
RuntimeError: cholesky_cpu: U(135,135) is zero, singular U. It looks like this is a know bug – see the following two github issues: github.com/ ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found