Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: cholesky_cpu: U(63,63) is zero, singular U.

See original GitHub issue

This is very similar to #99 but with newer versions of everything. Again, trying to tune the hyperparameters of a neural net, and got this runtime error.

ax=0.1.11
botorch=0.2.0
gpytorch=1.1.1
torch=1.5.0
python=3.8.1
os=Ubuntu 18.04

I have a lot of data that might be enough to reproduce the failure with some effort, but am likely missing a few key things. When I turn it back on I’ll make sure I capture the random seed, and turn up the auto-save frequency.

Here’s the parameter definition I was using:

ax_range = [
    {
        "name": "dropout",
        "type": "range",
        "bounds": [0.0, 1.0],
        "value_type": "float",  
        "log_scale": False, 
    },
    {
        "name": "num_layers",
        "type": "range",
        "bounds": [1, 6],
        "value_type": "int",
        "log_scale": False,
    },
    {
        "name": "fc_dim",
        "type": "range",
        "bounds": [10, 1000],
        "value_type": "int",
        "log_scale": True,
    },
    {
        "name": "lr",
        "type": "range",
        "bounds": [1e-5, 0.1],
        "value_type": "float",
        "log_scale": True,
    },
]

I recorded most of the configurations & objectives here: singular-cholesky.csv.log, but lost the last few because they didn’t auto-save before the crash. The final few objectives (rounded) were:

run=5 trial=104 score=2.57 best_score=1.69
run=5 trial=105 score=1.78 best_score=1.69
run=5 trial=106 score=2.35 best_score=1.69
run=5 trial=107 score=1.82 best_score=1.69
run=5 trial=108 score=3.96 best_score=1.69
run=5 trial=109 score=2.61 best_score=1.69
run=5 trial=110 score=2.14 best_score=1.69
run=5 trial=111 score=1.71 best_score=1.69
run=5 trial=112 score=2.39 best_score=1.69
run=5 trial=113 score=2.00 best_score=1.69

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-9-df78cf250713> in <module>
...

full stack trace:

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/ax/service/ax_client.py in get_next_trial(self)
    289             Tuple of trial parameterization, trial index
    290         """
--> 291         trial = self.experiment.new_trial(generator_run=self._gen_new_generator_run())
    292         logger.info(
    293             f"Generated new trial {trial.index} with parameters "

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/ax/service/ax_client.py in _gen_new_generator_run(self, n)
    928             # Filter out GPYTorch warnings to avoid confusing users.
    929             warnings.simplefilter("ignore")
--> 930             return not_none(self.generation_strategy).gen(
    931                 experiment=self.experiment,
    932                 n=n,

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/ax/modelbridge/generation_strategy.py in gen(self, experiment, data, n, **kwargs)
    339             )
    340         model = not_none(self.model)
--> 341         generator_run = model.gen(
    342             n=n,
    343             **consolidate_kwargs(

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/ax/modelbridge/base.py in gen(self, n, search_space, optimization_config, pending_observations, fixed_features, model_gen_options)
    608 
    609         # Apply terminal transform and gen
--> 610         observation_features, weights, best_obsf, gen_metadata = self._gen(
    611             n=n,
    612             search_space=search_space,

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/ax/modelbridge/array.py in _gen(self, n, search_space, pending_observations, fixed_features, model_gen_options, optimization_config)
    200         )
    201         # Generate the candidates
--> 202         X, w, gen_metadata = self._model_gen(
    203             n=n,
    204             bounds=bounds,

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/ax/modelbridge/torch.py in _model_gen(self, n, bounds, objective_weights, outcome_constraints, linear_constraints, fixed_features, pending_observations, model_gen_options, rounding_func, target_fidelities)
    202         tensor_rounding_func = self._array_callable_to_tensor_callable(rounding_func)
    203         # pyre-fixme[16]: `Optional` has no attribute `gen`.
--> 204         X, w, gen_metadata = self.model.gen(
    205             n=n,
    206             bounds=bounds,

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/ax/models/torch/botorch.py in gen(self, n, bounds, objective_weights, outcome_constraints, linear_constraints, fixed_features, pending_observations, model_gen_options, rounding_func, target_fidelities)
    355             inequality_constraints = None
    356 
--> 357         acquisition_function = self.acqf_constructor(  # pyre-ignore: [28]
    358             model=model,
    359             objective_weights=objective_weights,

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/ax/models/torch/botorch_defaults.py in get_NEI(model, objective_weights, outcome_constraints, X_observed, X_pending, **kwargs)
    200             objective=obj_tf, constraints=con_tfs or [], infeasible_cost=inf_cost
    201         )
--> 202     return get_acquisition_function(
    203         acquisition_function_name="qNEI",
    204         model=model,

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/botorch/acquisition/utils.py in get_acquisition_function(acquisition_function_name, model, objective, X_observed, X_pending, mc_samples, qmc, seed, **kwargs)
     86         )
     87     elif acquisition_function_name == "qNEI":
---> 88         return monte_carlo.qNoisyExpectedImprovement(
     89             model=model,
     90             X_baseline=X_observed,

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/botorch/acquisition/monte_carlo.py in __init__(self, model, X_baseline, sampler, objective, X_pending, prune_baseline)
    216         )
    217         if prune_baseline:
--> 218             X_baseline = prune_inferior_points(
    219                 model=model, X=X_baseline, objective=objective
    220             )

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/botorch/acquisition/utils.py in prune_inferior_points(model, X, objective, num_samples, max_frac)
    214     sampler = SobolQMCNormalSampler(num_samples=num_samples)
    215     with torch.no_grad():
--> 216         posterior = model.posterior(X=X)
    217         samples = sampler(posterior)
    218     if objective is None:

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/botorch/models/gpytorch.py in posterior(self, X, output_indices, observation_noise, **kwargs)
    300                     X=X, original_batch_shape=self._input_batch_shape
    301                 )
--> 302             mvn = self(X)
    303             if observation_noise is not False:
    304                 if torch.is_tensor(observation_noise):

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/models/exact_gp.py in __call__(self, *args, **kwargs)
    326             # Make the prediction
    327             with settings._use_eval_tolerance():
--> 328                 predictive_mean, predictive_covar = self.prediction_strategy.exact_prediction(full_mean, full_covar)
    329 
    330             # Reshape predictive mean to match the appropriate event shape

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/models/exact_prediction_strategies.py in exact_prediction(self, joint_mean, joint_covar)
    300 
    301         return (
--> 302             self.exact_predictive_mean(test_mean, test_train_covar),
    303             self.exact_predictive_covar(test_test_covar, test_train_covar),
    304         )

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/models/exact_prediction_strategies.py in exact_predictive_mean(self, test_mean, test_train_covar)
    318         # You **cannot* use addmv here, because test_train_covar may not actually be a non lazy tensor even for an exact
    319         # GP, and using addmv requires you to delazify test_train_covar, which is obviously a huge no-no!
--> 320         res = (test_train_covar @ self.mean_cache.unsqueeze(-1)).squeeze(-1)
    321         res = res + test_mean
    322 

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/utils/memoize.py in g(self, *args, **kwargs)
     32         cache_name = name if name is not None else method
     33         if not is_in_cache(self, cache_name):
---> 34             add_to_cache(self, cache_name, method(self, *args, **kwargs))
     35         return get_from_cache(self, cache_name)
     36 

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/models/exact_prediction_strategies.py in mean_cache(self)
    267 
    268         train_labels_offset = (self.train_labels - train_mean).unsqueeze(-1)
--> 269         mean_cache = train_train_covar.inv_matmul(train_labels_offset).squeeze(-1)
    270 
    271         if settings.detach_test_caches.on():

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/lazy/lazy_tensor.py in inv_matmul(self, right_tensor, left_tensor)
    932         func = InvMatmul
    933         if left_tensor is None:
--> 934             return func.apply(self.representation_tree(), False, right_tensor, *self.representation())
    935         else:
    936             return func.apply(self.representation_tree(), True, left_tensor, right_tensor, *self.representation())

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/functions/_inv_matmul.py in forward(ctx, representation_tree, has_left, *args)
     45             res = left_tensor @ res
     46         else:
---> 47             solves = _solve(lazy_tsr, right_tensor)
     48             res = solves
     49 

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/functions/_inv_matmul.py in _solve(lazy_tsr, rhs)
      9 def _solve(lazy_tsr, rhs):
     10     if settings.fast_computations.solves.off() or lazy_tsr.size(-1) <= settings.max_cholesky_size.value():
---> 11         return lazy_tsr._cholesky()._cholesky_solve(rhs)
     12     else:
     13         with torch.no_grad():

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/utils/memoize.py in g(self, *args, **kwargs)
     32         cache_name = name if name is not None else method
     33         if not is_in_cache(self, cache_name):
---> 34             add_to_cache(self, cache_name, method(self, *args, **kwargs))
     35         return get_from_cache(self, cache_name)
     36 

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/lazy/lazy_tensor.py in _cholesky(self)
    412 
    413         # contiguous call is necessary here
--> 414         cholesky = psd_safe_cholesky(evaluated_mat).contiguous()
    415         return NonLazyTensor(cholesky)
    416 

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/utils/cholesky.py in psd_safe_cholesky(A, upper, out, jitter)
     46             except RuntimeError:
     47                 continue
---> 48         raise e

~/anaconda3/envs/cvnlp/lib/python3.8/site-packages/gpytorch/utils/cholesky.py in psd_safe_cholesky(A, upper, out, jitter)
     23     """
     24     try:
---> 25         L = torch.cholesky(A, upper=upper, out=out)
     26         return L
     27     except RuntimeError as e:

RuntimeError: cholesky_cpu: U(63,63) is zero, singular U.

Issue Analytics

State:
Created 3 years ago
Comments:11 (8 by maintainers)

Top GitHub Comments

5reactions

leopdcommented, May 5, 2020

I’m of the opinion that numeric stability by itself isn’t worth a huge effort to fix when the underlying issue really comes down to operator error – or put another way, that moving forwards appropriately really requires operator attention. I think the most appropriate fix here might be a better error message saying something like “Perhaps one or more parameters has converged?” Automated diagnosis of this would be nice, but again seems like more effort than it’s worth. This github issue will help certainly help document it for future users, but a bit more of an explicit message would be a sufficient fix IMHO.

1reaction

leopdcommented, May 5, 2020

I’m not passing either for SEM - just passing a float:

  self.ax_client.complete_trial(trial_index=self.trial_index, raw_data=float(objective_val))

so whatever the default is. My understanding from the docs is that this should cause the noise to be treated as a free parameter to optimize for. This problem should not be treated as noiseless.