question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Steps to troubleshoot if `fit` fails

See original GitHub issue

👋 I’m really enjoying the ability to perform Bayesian mixed-effect models with bambi, but I had a frustrating first experience. I’ll detail my thought process as an end-user.

I followed the examples in the docs, but when I went to perform a mixed-effects model with my data (csv attached below), I hit the following bug:

model = Model("od ~ temp + (1|source) + 0", df)
results = model.fit(draws=2000, chains=2)
/Users/camerondavidson-pilon/venvs/data/lib/python3.9/site-packages/pymc3/step_methods/hmc/quadpotential.py:224: RuntimeWarning: divide by zero encountered in true_divide
  np.divide(1, self._stds, out=self._inv_stds)
/Users/camerondavidson-pilon/venvs/data/lib/python3.9/site-packages/pymc3/step_methods/hmc/quadpotential.py:203: RuntimeWarning: invalid value encountered in multiply
  return np.multiply(self._var, x, out=out)
---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/camerondavidson-pilon/venvs/data/lib/python3.9/site-packages/pymc3/parallel_sampling.py", line 137, in run
    self._start_loop()
  File "/Users/camerondavidson-pilon/venvs/data/lib/python3.9/site-packages/pymc3/parallel_sampling.py", line 191, in _start_loop
    point, stats = self._compute_point()
  File "/Users/camerondavidson-pilon/venvs/data/lib/python3.9/site-packages/pymc3/parallel_sampling.py", line 216, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/Users/camerondavidson-pilon/venvs/data/lib/python3.9/site-packages/pymc3/step_methods/arraystep.py", line 276, in step
    apoint, stats = self.astep(array)
  File "/Users/camerondavidson-pilon/venvs/data/lib/python3.9/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 147, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/Users/camerondavidson-pilon/venvs/data/lib/python3.9/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 272, in raise_ok
    raise ValueError("\n".join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV `1|source_offset`.ravel()[0] is zero.
The derivative of RV `1|source_sigma_log__`.ravel()[0] is zero.
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV `1|source_offset`.ravel()[0] is zero.
The derivative of RV `1|source_sigma_log__`.ravel()[0] is zero.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-95-55d7fad1560b> in <module>
----> 1 results = model.fit(draws=2000, chains=2)

~/venvs/data/lib/python3.9/site-packages/bambi/models.py in fit(self, omit_offsets, backend, **kwargs)
    213             )
    214
--> 215         return self.backend.run(omit_offsets=omit_offsets, **kwargs)
    216
    217     def build(self, backend="pymc"):

~/venvs/data/lib/python3.9/site-packages/bambi/backends/pymc.py in run(self, start, method, init, n_init, omit_offsets, **kwargs)
    133             draws = kwargs.pop("draws", 1000)
    134             with model:
--> 135                 idata = pm.sample(
    136                     draws,
    137                     start=start,

~/venvs/data/lib/python3.9/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, pickle_backend, **kwargs)
    557         _print_step_hierarchy(step)
    558         try:
--> 559             trace = _mp_sample(**sample_args, **parallel_args)
    560         except pickle.PickleError:
    561             _log.warning("Could not pickle model, sampling singlethreaded.")

~/venvs/data/lib/python3.9/site-packages/pymc3/sampling.py in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, callback, discard_tuned_samples, mp_ctx, pickle_backend, **kwargs)
   1475         try:
   1476             with sampler:
-> 1477                 for draw in sampler:
   1478                     trace = traces[draw.chain - chain]
   1479                     if trace.supports_sampler_stats and draw.stats is not None:

~/venvs/data/lib/python3.9/site-packages/pymc3/parallel_sampling.py in __iter__(self)
    477
    478         while self._active:
--> 479             draw = ProcessAdapter.recv_draw(self._active)
    480             proc, is_last, draw, tuning, stats, warns = draw
    481             self._total_draws += 1

~/venvs/data/lib/python3.9/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
    357             else:
    358                 error = RuntimeError("Chain %s failed." % proc.chain)
--> 359             raise error from old_error
    360         elif msg[0] == "writing_done":
    361             proc._readable = True

RuntimeError: Chain 0 failed.

My next steps were to scale back the formula to something simpler:

model = Model("od ~ temp + 1", df)
results = model.fit(draws=2000, chains=2)

This failed, similarly. The error messages didn’t provide me with much information about what could have been wrong, and googling the error suggested a chain was falling into a bad region - okay.

I inspected my data, and nothing about it looked off. I inspected the priors by printing model, and they seemed a tighter than I was expecting.

Formula: od ~ temp + 1
Family name: Gaussian
Link: identity
Observations: 39
Priors:
  Intercept ~ Normal(mu: 0.03708401, sigma: 0.01820133)
  temp ~ Normal(mu: 0, sigma: 0.00049071)
  sigma ~ HalfStudentT(nu: 4, sigma: 0.00307659)

I thought the problem may have been here for a while and so I tried widening the priors - no effect.

On a whim, I tried init="adapt_diag" in the fit as I saw this in other PyMC3 examples, and this worked. I was able to run both models now successfully. I guess the jitter was pushing my (tiny) priors into bad regions?

I’m lucky that I’m familiar with how Bayesian inference work in the backend, but I imagine other users, who are attracted to the high-level API of bambi, have less experience, and probably would have churned off the package quickly. My suggestion would be some docs on troubleshooting if fit fails, or even better: having bambi detect these problems and correct them auto-magically before inference is done.

Dataset used: obs.csv

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
CamDavidsonPiloncommented, Jul 28, 2021

Thanks team,

A solution like #383 is kinda what I was hoping for: a low-level, under-the-covers check for problems and provided solution so users don’t need to go digging around discourse or ipynb docs. This would have likely completely solved the problem I was having. I’m sure others will benefit, too.

0reactions
tomicaprettocommented, Aug 4, 2021

Feel free to open it again if you experience new problems any other problems with the sampler initialization

Read more comments on GitHub >

github_iconTop Results From Across the Web

Failed Fit Test Troubleshooting - YouTube
See how to analyze and troubleshoot a failing fit test. Users will see how the Real-Time Fit Check™ mode, available within the FitPro+ ......
Read more >
Troubleshooting: Failed Fit Tests RESFT 301 - face-fit.co.uk
Clean off any excessive dirt or contamination. • Look for signs of aging, such as being brittle or warped. • Look for damage...
Read more >
FAQ-1085 Why does my fit fail with no iterations ... - OriginLab
Fit did not converge - reason unknown". When the problem is input data, excluding one bad point may resolve the issue. If the...
Read more >
Problem-Solution Fit: What Is It + How To Get It [Customer ...
1. It is important to have a good picture of your customer, not only the demographics but preferably also sociographic data. 2. Make...
Read more >
Failing Goodness of Fit - How to Combat the F-test Headache
This test essentially tries to determine two things – can the lack-of-fit error be attributed to the pure error, and if so, by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found