Cellbender V2 always hits NaN loss, crashes
See original GitHub issueHi there, I was super pumped to try out version 2 so I pulled that branch. Unfortunately when I run cellbender remove-background --input ./spliced/ --output s_cellbended_ambient_200_1000_1000e_V2/s_cellbended.h5 --cuda --expected-cells 1598 --total-droplets-included 11598 --epochs 1000 --z-dim 200 --z-layers 1000 --learning-rate .001 --model ambient
it always crashes after <100 epochs saying NaN training loss. Any idea why? Thought it might be helpful to report, anything to get the opportunity to get v2 running sooner!
cellbender:remove-background: [epoch 034] average training loss: 1529.0032 /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/ATen/native/cuda/Distributions.cu:290: lambda [](int, float &, float &, float &, float &, const float &, const float &, const float &, const float &)->auto::operator()(int, float &, float &, float &, float &, const float &, const float &, const float &, const float &)->auto: block: [0,0,0], thread: [96,0,0] Assertion
0 <= p4 && p4 <= 1 failed.
/utils/newminiconda3/envs/cellbenderV2/lib/python3.7/site-packages/pyro/infer/traceenum_elbo.py:419: UserWarning: Encountered NaN: loss log_p(c=2000 | full) = -23.890047073364258 log_p(c=2000 | empty) = -34.5873908996582 cell log_sum.mean() is 8.426836013793945 ~cell log_sum.mean() is 5.024422645568848 cell log_nnz.mean() is 7.62549352645874 ~cell log_nnz.mean() is 4.873218536376953 cell cosine_overlap.mean() is 0.8181769251823425 ~cell cosine_overlap.mean() is 0.4179271459579468 x.mean() is 2.4015369490371086e-05 x.std() is 0.0004832973063457757
RuntimeError: CUDA error: device-side assert triggered Trace Shapes: Param Sites: encoder_z$$$linears.0.weight 1000 41640 encoder_z$$$linears.0.bias 1000 encoder_z$$$loc_out.weight 200 1000 encoder_z$$$loc_out.bias 200 encoder_z$$$sig_out.weight 200 1000 encoder_z$$$sig_out.bias 200 encoder_other$$$linears.0.weight 50 41643 encoder_other$$$linears.0.bias 50 encoder_other$$$linears.1.weight 10 50 encoder_other$$$linears.1.bias 10 encoder_other$$$output.weight 4 10 encoder_other$$$output.bias 4 d_cell_scale alpha0_scale d_empty_loc d_empty_scale chi_ambient 41640 Sample Sites: data dist | value 500 | d_empty dist 500 | value 500 | p_passback dist 500 | value 500 | y dist 500 | value 500 |
Issue Analytics
- State:
- Created 4 years ago
- Comments:18 (7 by maintainers)
Top GitHub Comments
To echo @sjfleming - I had a similar issue for a few samples, but dropping the learning rate and decreasing zdim to ~50 solved the issue. Sharing in case this is useful to anyone else.
Actually the new 0.2 release solved all my issues (Great job @sjfleming ) where it used to crash on 30-50% of my samples, I can now pass all 200 of them without NaN crashes as previously described… So I’m wondering what’s going on with @laijen000 samples