baseline functions differ in R and Python
See original GitHub issueHello! I am translating the Cox model from R to Python and found that baseline functions in R differs from the baseline functions in Python. Based on the data from the file test.xlsx, the results are as follows:
It seemed strange to me, considering that the lifelines library is based on the codes of functions from R (if I understand correctly).
At the same time, the characteristics of the Cox model and the coefficient before regressor “var_const” turned out to be the same in R and Python (coef[var_cost] = 0,05295).
Tell me please why baseline functions may differ in R and Python?
Code in R:
library("survival")
library(survminer)
#### Data download:
test_R <- read_excel("test.xlsx")
#### Data preparation:
test_R$start = test_R$months
test_R$stop = test_R$start + 1
#### Run the model:
res.cox1 <- coxph(Surv(start, stop, event)~ var_const, data = test_R, method = 'efron')
summary(res.cox1)
#### Calculate baseline cumulative hazard function:
bhest = basehaz(fit = res.cox1)
basehaz_predict <- data.frame(time = bhest$time,
hazard = bhest$hazard)
#### Calculate baseline survival function
cox_survival = survfit(res.cox1)
surv_predict <- data.frame(time = cox_survival$time,
survival = cox_survival$surv)
Code in Python:
import pandas as pd
import numpy as np
from lifelines.utils import to_long_format
from lifelines import CoxTimeVaryingFitter
#### Data download:
df = pd.read_excel('test.xlsx')
#### Data preparation:
df_model = df_model[['id', 'event', 'months', ‘var_const’]]
df_model = to_long_format(df_model, duration_col = "months")
df_model['start'] = df_model['stop']
df_model['stop'] = df_model['start']+1
#### Run the model:
ctv = CoxTimeVaryingFitter()
ctv.fit(df_model, id_col ='id', event_col='event', start_col = 'start', stop_col = 'stop')
ctv.print_summary()
#### Calculate baseline cumulative hazard function:
ctv.baseline_cumulative_hazard_
#### Calculate baseline survival function:
ctv.baseline_survival_
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Stylistic differences between R and Python in setting up the ...
These include: 1) partitioning the data, 2) validating the data partition, 3) balancing the data and, 4) baseline model performance. The ...
Read more >Stylistic differences between R and Python in modelling data ...
How to use R and Python to predict the probability of an event, based on prior knowledge of conditions that relate to it....
Read more >R vs Python - a One-on-One Comparison - Shirin's playgRound
Their main difference is that R has traditionally been geared towards statistical analysis, while Python is more generalist.
Read more >Python baseline correction library - numpy - Stack Overflow
There is a python library available for baseline correction/removal. ... %%timeit -n 1000 -r 10 y = randn(1000) baseline_als(y, 10000, 0.05) # function...
Read more >How To Implement Baseline Machine Learning Algorithms ...
Update Aug/2018: Tested and updated to work with Python 3.6. ... The function will work for both classification and regression problems.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@CamDavidsonPilon, may be you will be interested, I also apply a file with the results of comparison of the baseline functions not only of R and Python but also of STATA (the dataset is the same - test.xlsx): Compare_baseline_functions_R_Python_STATA.xlsx
The most interesting thing is that all three programs give three slightly different results:)
Code in STATA:
Hello - TBH I discovered the difference using a data set that I would not be able to share (due to confidentiality concerns), so I replicated it with the data provided in this post by the original poster. I find the identical results they did (in their original screen shot at the top of this post). That’s why I was hoping there’d been some analysis using that data set that helped illuminate the issue back in 2020. But if you’re able to do some investigation now, if we could both use the original data when this question was originally made, that would be great.