Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

baseline functions differ in R and Python

See original GitHub issue

Hello! I am translating the Cox model from R to Python and found that baseline functions in R differs from the baseline functions in Python. Based on the data from the file test.xlsx, the results are as follows: Compare_baseline_functions — копия_3

It seemed strange to me, considering that the lifelines library is based on the codes of functions from R (if I understand correctly).

At the same time, the characteristics of the Cox model and the coefficient before regressor “var_const” turned out to be the same in R and Python (coef[var_cost] = 0,05295).

Tell me please why baseline functions may differ in R and Python?

Code in R:

library("survival")
library(survminer)

#### Data download:
test_R <- read_excel("test.xlsx")

#### Data preparation:
test_R$start = test_R$months
test_R$stop = test_R$start + 1

#### Run the model:
res.cox1 <- coxph(Surv(start, stop, event)~ var_const, data = test_R, method = 'efron')
summary(res.cox1)

#### Calculate baseline cumulative hazard function:
bhest = basehaz(fit = res.cox1)
basehaz_predict <- data.frame(time = bhest$time,
                              hazard = bhest$hazard)

#### Calculate baseline survival function
cox_survival = survfit(res.cox1)
surv_predict <- data.frame(time = cox_survival$time,
                           survival = cox_survival$surv)

Code in Python:

import pandas as pd
import numpy as np
from lifelines.utils import to_long_format
from lifelines import CoxTimeVaryingFitter

#### Data download:
df = pd.read_excel('test.xlsx')

#### Data preparation:
df_model = df_model[['id', 'event', 'months', ‘var_const’]]
df_model = to_long_format(df_model, duration_col = "months")
df_model['start'] = df_model['stop']
df_model['stop'] = df_model['start']+1

#### Run the model:
ctv = CoxTimeVaryingFitter()
ctv.fit(df_model, id_col ='id', event_col='event', start_col = 'start', stop_col = 'stop')
ctv.print_summary()

#### Calculate baseline cumulative hazard function:
ctv.baseline_cumulative_hazard_

#### Calculate baseline survival function:
ctv.baseline_survival_

Issue Analytics

State:
Created 3 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

Valery2511commented, Sep 30, 2020

@CamDavidsonPilon, may be you will be interested, I also apply a file with the results of comparison of the baseline functions not only of R and Python but also of STATA (the dataset is the same - test.xlsx): Compare_baseline_functions_R_Python_STATA.xlsx

The most interesting thing is that all three programs give three slightly different results:)

Code in STATA:

#### Data download:
import excel "test.xlsx", sheet("Sheet1") firstrow
sort id months

#### Run the model:
stset months, id(id) failure(event)
stcox var_const, efron

#### Calculate baseline cumulative hazard function:
predict S0_hazard_cumulative, basechazard

#### Calculate baseline survival function:
predict S0_baseline_survivor, basesurv

0reactions

PortlandMichellecommented, Jun 17, 2022

Hello - TBH I discovered the difference using a data set that I would not be able to share (due to confidentiality concerns), so I replicated it with the data provided in this post by the original poster. I find the identical results they did (in their original screen shot at the top of this post). That’s why I was hoping there’d been some analysis using that data set that helped illuminate the issue back in 2020. But if you’re able to do some investigation now, if we could both use the original data when this question was originally made, that would be great.

Top Results From Across the Web

Stylistic differences between R and Python in setting up the ...

These include: 1) partitioning the data, 2) validating the data partition, 3) balancing the data and, 4) baseline model performance. The ...

Stylistic differences between R and Python in modelling data ...

How to use R and Python to predict the probability of an event, based on prior knowledge of conditions that relate to it....

R vs Python - a One-on-One Comparison - Shirin's playgRound

Their main difference is that R has traditionally been geared towards statistical analysis, while Python is more generalist.

Python baseline correction library - numpy - Stack Overflow

There is a python library available for baseline correction/removal. ... %%timeit -n 1000 -r 10 y = randn(1000) baseline_als(y, 10000, 0.05) # function...

How To Implement Baseline Machine Learning Algorithms ...

Update Aug/2018: Tested and updated to work with Python 3.6. ... The function will work for both classification and regression problems.