question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cellbender V2 always hits NaN loss, crashes

See original GitHub issue

Hi there, I was super pumped to try out version 2 so I pulled that branch. Unfortunately when I run cellbender remove-background --input ./spliced/ --output s_cellbended_ambient_200_1000_1000e_V2/s_cellbended.h5 --cuda --expected-cells 1598 --total-droplets-included 11598 --epochs 1000 --z-dim 200 --z-layers 1000 --learning-rate .001 --model ambient it always crashes after <100 epochs saying NaN training loss. Any idea why? Thought it might be helpful to report, anything to get the opportunity to get v2 running sooner!

cellbender:remove-background: [epoch 034] average training loss: 1529.0032 /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/ATen/native/cuda/Distributions.cu:290: lambda [](int, float &, float &, float &, float &, const float &, const float &, const float &, const float &)->auto::operator()(int, float &, float &, float &, float &, const float &, const float &, const float &, const float &)->auto: block: [0,0,0], thread: [96,0,0] Assertion 0 <= p4 && p4 <= 1 failed. and /utils/newminiconda3/envs/cellbenderV2/lib/python3.7/site-packages/pyro/infer/traceenum_elbo.py:419: UserWarning: Encountered NaN: loss log_p(c=2000 | full) = -23.890047073364258 log_p(c=2000 | empty) = -34.5873908996582 cell log_sum.mean() is 8.426836013793945 ~cell log_sum.mean() is 5.024422645568848 cell log_nnz.mean() is 7.62549352645874 ~cell log_nnz.mean() is 4.873218536376953 cell cosine_overlap.mean() is 0.8181769251823425 ~cell cosine_overlap.mean() is 0.4179271459579468 x.mean() is 2.4015369490371086e-05 x.std() is 0.0004832973063457757

RuntimeError: CUDA error: device-side assert triggered Trace Shapes: Param Sites: encoder_z$$$linears.0.weight 1000 41640 encoder_z$$$linears.0.bias 1000 encoder_z$$$loc_out.weight 200 1000 encoder_z$$$loc_out.bias 200 encoder_z$$$sig_out.weight 200 1000 encoder_z$$$sig_out.bias 200 encoder_other$$$linears.0.weight 50 41643 encoder_other$$$linears.0.bias 50 encoder_other$$$linears.1.weight 10 50 encoder_other$$$linears.1.bias 10 encoder_other$$$output.weight 4 10 encoder_other$$$output.bias 4 d_cell_scale alpha0_scale d_empty_loc d_empty_scale chi_ambient 41640 Sample Sites: data dist | value 500 | d_empty dist 500 | value 500 | p_passback dist 500 | value 500 | y dist 500 | value 500 |

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:18 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
letaylorcommented, Dec 3, 2020

To echo @sjfleming - I had a similar issue for a few samples, but dropping the learning rate and decreasing zdim to ~50 solved the issue. Sharing in case this is useful to anyone else.

1reaction
mtvectorcommented, Oct 26, 2020

Actually the new 0.2 release solved all my issues (Great job @sjfleming ) where it used to crash on 30-50% of my samples, I can now pass all 200 of them without NaN crashes as previously described… So I’m wondering what’s going on with @laijen000 samples

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error in running remove background · Issue #48 - GitHub
Hello, I successfully ran Cell bender on my sample the first time with ... 1/4 way through the training, with a encountered NaN...
Read more >
remove-background - CellBender documentation
In this tutorial, we will run remove-background on a small dataset derived from the 10x Genomics pbmc4k scRNA-seq dataset (v2 Chemistry, CellRanger 2.1.0)....
Read more >
Untitled
Mr bean cartoon 2 hours non stop, Vakcina bebe od 3 meseca? ... Rt 35 nj accident, Engelbert greatest hits download, Arabia musta...
Read more >
Biomechanics-Principles & Applications.pdf - BME - Yumpu
When this material property is enteredinto calculations based on the microtextural arrangement, the overall anisotropic elastic anisotropy canbe ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found