`censored()` proposal
See original GitHub issueWe have #541 and #543 where models for censored data are discussed. Here I sketch the function we could use to handle censoring. It’s inspired by Surv()
and cens()
from the survival and brms packages in R respectively.
import numpy as np
import pandas as pd
from formulae import design_matrices
Proposal 1
The function has three arguments. The first is the value of the variable, the second is the status (“left”, “none”, “right”, and “interval”), and the third is an optional value that is passed when we use interval censoring.
def censored(values, status, right=None):
STATUS_MAPPING = {"left": -1, "none": 0, "right": 1, "interval": 2}
values = np.asarray(values)
status = np.asarray(status)
assert len(values) == len(status)
if right is not None:
right = np.asarray(right)
assert len(values) == len(right)
left_values = values
right_values = right
status = np.asarray([STATUS_MAPPING[s] for s in status])
if right_values is not None:
result = np.column_stack([left_values, right_values, status])
else:
result = np.column_stack([left_values, status])
return result
# Will allow us to do our stuff within Bambi
censored.__metadata__ = {"kind": "censored"}
Dataset 1 Right censoring
rng = np.random.default_rng(1234)
size = 100
p = rng.beta(2, 20, size=size)
lifetime_true = rng.geometric(p)
censored_bool = lifetime_true > 35
observed_lifetime = [value if value <= 35 else 35 for value in lifetime_true]
status = ["right" if value else "none" for value in censored_bool]
data = pd.DataFrame({"lifetime": observed_lifetime, "status": status})
print(data.head())
lifetime status
0 4 none
1 3 none
2 35 right
3 3 none
4 35 right
Then we can use it as
dm = design_matrices("censored(lifetime, status) ~ 1", data)
print(dm.response)
print(np.asarray(dm.response)[:10])
ResponseMatrix
name: censored(lifetime, status)
kind: numeric
shape: (100, 2)
To access the actual design matrix do 'np.array(this_obj)'
[[ 4 0]
[ 3 0]
[35 1]
[ 3 0]
[35 1]
[15 0]
[21 0]
[28 0]
[ 1 0]
[ 3 0]]
One “drawback” of this approach appears when we consider interval censoring.
Dataset 1 Interval censoring
We know the value is within an interval, but we don’t know the exact value
rng = np.random.default_rng(1234)
size = 100
p = rng.beta(2, 20, size=size)
lifetime_true = rng.geometric(p)
censored_bool = np.logical_and(lifetime_true >= 10, lifetime_true <= 20)
observed_lifetime = [10 if value >= 10 and value <= 20 else value for value in lifetime_true]
status = ["interval" if value else "none" for value in censored_bool]
data2 = pd.DataFrame({"lower": observed_lifetime, "upper": 20, "status": status})
print(data2.head())
lower upper status
0 4 20 none
1 3 20 none
2 90 20 none
3 3 20 none
4 103 20 none
print(data2[data2["status"] == "interval"][:5])
lower upper status
5 10 20 interval
20 10 20 interval
26 10 20 interval
35 10 20 interval
37 10 20 interval
Here the call would look like
dm = design_matrices("censored(lower, status, upper) ~ 1", data2)
print(dm.response)
print(np.array(dm.response))
ResponseMatrix
name: censored(lower, status, upper)
kind: numeric
shape: (100, 3)
To access the actual design matrix do 'np.array(this_obj)'
[[ 4 20 0]
[ 3 20 0]
[ 90 20 0]
[ 3 20 0]
[103 20 0]]
It works well, but what makes it not very appealing to me is that we have “value”, “status”, “value” in the signature.
But I have another proposal, that I’m calling censored2()
for now. If this becomes the chosen one, of course it will be named censored()
.
def censored2(*args):
STATUS_MAPPING = {"left": -1, "none": 0, "right": 1, "interval": 2}
if len(args) == 2:
left, status = args
right = None
elif len(args) == 3:
left, right, status = args
else:
raise
assert len(left) == len(status)
if right is not None:
right = np.asarray(right)
assert len(left) == len(right)
status = np.asarray([STATUS_MAPPING[s] for s in status])
if right is not None:
result = np.column_stack([left, right, status])
else:
result = np.column_stack([left, status])
return result
Notice the only argument is an unnamed argument of variable length. Internally, we check it’s of length 2 or 3.
This allows us to do
design_matrices("censored2(lower, upper, status) ~ 1", data2).response
ResponseMatrix
name: censored2(lower, upper, status)
kind: numeric
shape: (100, 3)
and
design_matrices("censored2(lifetime, status) ~ 1", data).response
ResponseMatrix
name: censored2(lifetime, status)
kind: numeric
shape: (100, 2)
which reads much better to me.
In summary, we have two candidate implementations for censored()
. They both do the same, but they differ in the signature. One signature has 3 arguments well defined. The first is a value, the second a status, and the third is an optional value. The other signature has a single argument, which is an unnamed argument of variable length. Internally, we handle it differently depending on how many arguments we got. This allows the code to look like (value, status) and (value, value2, status) instead of (value, status, value2).
Note: This could be simplified a lot if we decide to support only left and right censoring. At the moment, PyMC supports only those out of the box. Interval censoring still requires more work on our end. I think it’s still worth considering interval censoring from the very beginning, because it may be supported at some point.
Issue Analytics
- State:
- Created a year ago
- Comments:8
That seems intuitive as well. 😃
To me the proposal for censored2 also looks cleanest. I agree with @tomicapretto with the issue on the
|
operator and would try to avoid that.Cox proportional hazard models in R (which most are very familiar with) use
Surv(lifetime, status) ~ 1
orSurv(lower, upper, status) ~ 1
for intervals or interval censoring, which would be equivalent to thecensored2
proposal.