Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CoxPH throws 'delta contains nan value(s). Convergence halted'

See original GitHub issue

There was a closed issue for this at https://github.com/CamDavidsonPilon/lifelines/issues/242 but it was closed based on the direction to ensure no columns had constant values. None of my columns have constant values (my data is attached compressed, github wouldn’t let me attach as a 2,000 row csv).

I’m able to fit an AalenAdditiveFilter to this data using the code below:

    model = AalenAdditiveFitter()
    model.fit(lifelines_df, duration_col='duration', event_col='event_observed')

but I get “ValueError: delta contains nan value(s). Convergence halted.” when fitting to a CoxPH model with similar code for CoxPH:

    model = CoxPHFitter()
    model.fit(lifelines_df, duration_col='duration', event_col='event_observed')

Thanks for this library!

lifelines_df_colnames_masked.csv.tar.gz

Issue Analytics

State:
Created 6 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

CamDavidsonPiloncommented, Nov 14, 2017

Here was my work (this is using the latest lifelines, 0.12, released recently):

from lifelines import CoxPHFitter
cp = CoxPHFitter()
data = pd.read_csv("lifelines_df_colnames_masked.csv", sep="|")
data = data.dropna()
cp.fit(data, 'duration', 'event_observed')

I got a lifelines warning about two columns, and my fitting failed with a NaN calc. However the warning tells us what the problem is:

lifelines/utils/__init__.py:981: RuntimeWarning: Column(s) ['dummy1', 'dummy7'] have very low variance. This may harm convergence. Try dropping this redundant column before fitting if convergence fails.
  warnings.warn(warning_text, RuntimeWarning)

So I dropped those:

data = data.drop(['dummy1', 'dummy7'], axis=1)
cp.fit(data, 'duration', 'event_observed')

Again, the fitting failed with a singular-matrix error. This implies some linear dependance.

data.corr()  # noticed num7 and num6 had perfect correlation with num5
data = data.drop(['num7', 'num6'], axis=1)
cp.fit(data, 'duration', 'event_observed')
cp.print_summary()

n=10186, number of events=3772

          coef  exp(coef)  se(coef)        z      p  lower 0.95  upper 0.95
num1    0.0592     1.0610    0.0213   2.7753 0.0055      0.0174      0.1010   **
num2    0.1001     1.1053    0.0207   4.8350 0.0000      0.0595      0.1407  ***
num3    0.0348     1.0354    0.0311   1.1163 0.2643     -0.0263      0.0958
dummy2 -1.0526     0.3490    1.0025  -1.0500 0.2937     -3.0179      0.9127
dummy3 -1.0857     0.3377    1.0025  -1.0830 0.2788     -3.0509      0.8796
dummy4 -1.0163     0.3619    1.0016  -1.0146 0.3103     -2.9799      0.9473
dummy5 -1.1313     0.3226    1.0420  -1.0857 0.2776     -3.1739      0.9114
dummy6 -1.0859     0.3376    1.0304  -1.0539 0.2919     -3.1058      0.9340
num4   -0.0052     0.9949    0.0004 -11.5358 0.0000     -0.0060     -0.0043  ***
num5   -0.3604     0.6974    0.0364  -9.9046 0.0000     -0.4317     -0.2890  ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Concordance = 0.580

0reactions

MaxPowerWasTakencommented, Nov 15, 2017

Thanks Cam,

Yeah the 100% correlation between some columns (‘num7’, ‘num6’) is caused by my doing development on a very small amount of data. In my last comment, I had dropped the same columns you do above, but got stuck at

RuntimeWarning: invalid value encountered in sqrt
se = np.sqrt(inv(-self.hessian).diagonal()) / self._norm_std

After upgrading to lifelines 0.12 (had latest conda-forge version, 0.11.2) and dropping the low-variance variables you dropped (‘dummy1’, ‘dummy7’), I now can reproduce your results. Those “low-variance” dummies were dummies that were almost always 0 in this split of the data, but not across all of my data.

With cross-validation I don’t always know in advance which dummy vars might be ‘low variance’ for a given data-split. I use a cross-validated pipeline and wanted to use a thinly-wrapped lifelines model as my estimator at the end of the pipeline. But I can’t dynamically drop whichever dummy-columns are low-variance at the end of my pipeline because the final-step (the model) needs to see the same columns in fit and predict. I’ll give this some more thought and maybe post a SO question if I can’t figure it out. Thanks again for the help and the library.

Feel free to close this issue