CoxPH throws 'delta contains nan value(s). Convergence halted'See original GitHub issue
There was a closed issue for this at https://github.com/CamDavidsonPilon/lifelines/issues/242 but it was closed based on the direction to ensure no columns had constant values. None of my columns have constant values (my data is attached compressed, github wouldn’t let me attach as a 2,000 row csv).
I’m able to fit an AalenAdditiveFilter to this data using the code below:
model = AalenAdditiveFitter() model.fit(lifelines_df, duration_col='duration', event_col='event_observed')
but I get “ValueError: delta contains nan value(s). Convergence halted.” when fitting to a CoxPH model with similar code for CoxPH:
model = CoxPHFitter() model.fit(lifelines_df, duration_col='duration', event_col='event_observed')
Thanks for this library!
- Created 6 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Here was my work (this is using the latest lifelines, 0.12, released recently):
from lifelines import CoxPHFitter cp = CoxPHFitter() data = pd.read_csv("lifelines_df_colnames_masked.csv", sep="|") data = data.dropna() cp.fit(data, 'duration', 'event_observed')
I got a lifelines warning about two columns, and my fitting failed with a NaN calc. However the warning tells us what the problem is:
lifelines/utils/__init__.py:981: RuntimeWarning: Column(s) ['dummy1', 'dummy7'] have very low variance. This may harm convergence. Try dropping this redundant column before fitting if convergence fails. warnings.warn(warning_text, RuntimeWarning)
So I dropped those:
data = data.drop(['dummy1', 'dummy7'], axis=1) cp.fit(data, 'duration', 'event_observed')
Again, the fitting failed with a singular-matrix error. This implies some linear dependance.
data.corr() # noticed num7 and num6 had perfect correlation with num5 data = data.drop(['num7', 'num6'], axis=1) cp.fit(data, 'duration', 'event_observed') cp.print_summary()
n=10186, number of events=3772 coef exp(coef) se(coef) z p lower 0.95 upper 0.95 num1 0.0592 1.0610 0.0213 2.7753 0.0055 0.0174 0.1010 ** num2 0.1001 1.1053 0.0207 4.8350 0.0000 0.0595 0.1407 *** num3 0.0348 1.0354 0.0311 1.1163 0.2643 -0.0263 0.0958 dummy2 -1.0526 0.3490 1.0025 -1.0500 0.2937 -3.0179 0.9127 dummy3 -1.0857 0.3377 1.0025 -1.0830 0.2788 -3.0509 0.8796 dummy4 -1.0163 0.3619 1.0016 -1.0146 0.3103 -2.9799 0.9473 dummy5 -1.1313 0.3226 1.0420 -1.0857 0.2776 -3.1739 0.9114 dummy6 -1.0859 0.3376 1.0304 -1.0539 0.2919 -3.1058 0.9340 num4 -0.0052 0.9949 0.0004 -11.5358 0.0000 -0.0060 -0.0043 *** num5 -0.3604 0.6974 0.0364 -9.9046 0.0000 -0.4317 -0.2890 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Concordance = 0.580
Yeah the 100% correlation between some columns (‘num7’, ‘num6’) is caused by my doing development on a very small amount of data. In my last comment, I had dropped the same columns you do above, but got stuck at
RuntimeWarning: invalid value encountered in sqrt se = np.sqrt(inv(-self.hessian).diagonal()) / self._norm_std
After upgrading to lifelines 0.12 (had latest conda-forge version, 0.11.2) and dropping the low-variance variables you dropped (‘dummy1’, ‘dummy7’), I now can reproduce your results. Those “low-variance” dummies were dummies that were almost always 0 in this split of the data, but not across all of my data.
With cross-validation I don’t always know in advance which dummy vars might be ‘low variance’ for a given data-split. I use a cross-validated pipeline and wanted to use a thinly-wrapped
lifelines model as my estimator at the end of the pipeline. But I can’t dynamically drop whichever dummy-columns are low-variance at the end of my pipeline because the final-step (the model) needs to see the same columns in
predict. I’ll give this some more thought and maybe post a SO question if I can’t figure it out. Thanks again for the help and the library.
Feel free to close this issue