Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`censored()` proposal

See original GitHub issue

We have #541 and #543 where models for censored data are discussed. Here I sketch the function we could use to handle censoring. It’s inspired by Surv() and cens() from the survival and brms packages in R respectively.

import numpy as np
import pandas as pd

from formulae import design_matrices

Proposal 1

The function has three arguments. The first is the value of the variable, the second is the status (“left”, “none”, “right”, and “interval”), and the third is an optional value that is passed when we use interval censoring.

def censored(values, status, right=None):
    STATUS_MAPPING = {"left": -1, "none": 0, "right": 1, "interval": 2}

    values = np.asarray(values)
    status = np.asarray(status)
    assert len(values) == len(status)
    if right is not None:
        right = np.asarray(right)
        assert len(values) == len(right)

    left_values = values
    right_values = right
    status = np.asarray([STATUS_MAPPING[s] for s in status])
    
    if right_values is not None:
        result = np.column_stack([left_values, right_values, status])
    else:
        result = np.column_stack([left_values, status])
    return result

# Will allow us to do our stuff within Bambi
censored.__metadata__ = {"kind": "censored"}

Dataset 1 Right censoring

rng = np.random.default_rng(1234)
size = 100
p = rng.beta(2, 20, size=size)
lifetime_true = rng.geometric(p)
censored_bool = lifetime_true > 35
observed_lifetime = [value if value <= 35 else 35 for value in lifetime_true]
status = ["right" if value else "none" for value in censored_bool]
data = pd.DataFrame({"lifetime": observed_lifetime, "status": status})
print(data.head())

   lifetime status
0         4   none
1         3   none
2        35  right
3         3   none
4        35  right

Then we can use it as

dm = design_matrices("censored(lifetime, status) ~ 1", data)
print(dm.response)
print(np.asarray(dm.response)[:10])

ResponseMatrix  
  name: censored(lifetime, status)
  kind: numeric
  shape: (100, 2)

To access the actual design matrix do 'np.array(this_obj)'
[[ 4  0]
 [ 3  0]
 [35  1]
 [ 3  0]
 [35  1]
 [15  0]
 [21  0]
 [28  0]
 [ 1  0]
 [ 3  0]]

One “drawback” of this approach appears when we consider interval censoring.

Dataset 1 Interval censoring

We know the value is within an interval, but we don’t know the exact value

rng = np.random.default_rng(1234)
size = 100
p = rng.beta(2, 20, size=size)
lifetime_true = rng.geometric(p)
censored_bool = np.logical_and(lifetime_true >= 10, lifetime_true <= 20)
observed_lifetime = [10 if value >= 10 and value <= 20 else value for value in lifetime_true]
status = ["interval" if value else "none" for value in censored_bool]
data2 = pd.DataFrame({"lower": observed_lifetime, "upper": 20, "status": status})
print(data2.head())

   lower  upper status
0      4     20   none
1      3     20   none
2     90     20   none
3      3     20   none
4    103     20   none

print(data2[data2["status"] == "interval"][:5])

    lower  upper    status
5      10     20  interval
20     10     20  interval
26     10     20  interval
35     10     20  interval
37     10     20  interval

Here the call would look like

dm = design_matrices("censored(lower, status, upper) ~ 1", data2)
print(dm.response)
print(np.array(dm.response))

ResponseMatrix  
  name: censored(lower, status, upper)
  kind: numeric
  shape: (100, 3)

To access the actual design matrix do 'np.array(this_obj)'
[[  4  20   0]
 [  3  20   0]
 [ 90  20   0]
 [  3  20   0]
 [103  20   0]]

It works well, but what makes it not very appealing to me is that we have “value”, “status”, “value” in the signature.

But I have another proposal, that I’m calling censored2() for now. If this becomes the chosen one, of course it will be named censored().

def censored2(*args):
    STATUS_MAPPING = {"left": -1, "none": 0, "right": 1, "interval": 2}

    if len(args) == 2:
        left, status = args
        right = None
    elif len(args) == 3:
        left, right, status = args
    else:
        raise
    
    assert len(left) == len(status)

    if right is not None:
        right = np.asarray(right)
        assert len(left) == len(right)

    status = np.asarray([STATUS_MAPPING[s] for s in status])
    
    if right is not None:
        result = np.column_stack([left, right, status])
    else:
        result = np.column_stack([left, status])
    
    return result

Notice the only argument is an unnamed argument of variable length. Internally, we check it’s of length 2 or 3.

This allows us to do

design_matrices("censored2(lower, upper, status) ~ 1", data2).response

ResponseMatrix  
  name: censored2(lower, upper, status)
  kind: numeric
  shape: (100, 3)

and

design_matrices("censored2(lifetime, status) ~ 1", data).response

ResponseMatrix  
  name: censored2(lifetime, status)
  kind: numeric
  shape: (100, 2)

which reads much better to me.

In summary, we have two candidate implementations for censored(). They both do the same, but they differ in the signature. One signature has 3 arguments well defined. The first is a value, the second a status, and the third is an optional value. The other signature has a single argument, which is an unnamed argument of variable length. Internally, we handle it differently depending on how many arguments we got. This allows the code to look like (value, status) and (value, value2, status) instead of (value, status, value2).

Note: This could be simplified a lot if we decide to support only left and right censoring. At the moment, PyMC supports only those out of the box. Interval censoring still requires more work on our end. I think it’s still worth considering interval censoring from the very beginning, because it may be supported at some point.

Issue Analytics

State:
Created a year ago
Comments:8

Top GitHub Comments

1reaction

don-jilcommented, Oct 19, 2022

That seems intuitive as well. 😃

1reaction

ipacommented, Oct 19, 2022

To me the proposal for censored2 also looks cleanest. I agree with @tomicapretto with the issue on the | operator and would try to avoid that.

Cox proportional hazard models in R (which most are very familiar with) use Surv(lifetime, status) ~ 1 or Surv(lower, upper, status) ~ 1 for intervals or interval censoring, which would be equivalent to the censored2 proposal.

Top Results From Across the Web

Lecture 9 Models for Censored and Truncated Data

(2) Incidental Truncation: Wage offer married women. Only those who are working has wage information. It is the people's decision, not the survey's...

Censorship | The First Amendment Encyclopedia

Censors seek to limit freedom of thought and expression by restricting spoken words, printed matter, symbolic messages, freedom of association, books, art, ...

A Bold Proposal for Fighting Censorship

A Bold Proposal for Fighting Censorship: Increase the Collateral Damage ... or call for collective action the term will be censored.

Interval censoring - PMC - NCBI - NIH

In the paper, they proposed a self-consistency algorithm for estimating the distribution of the survival variable of interest. Following their ...

Using icenReg for interval censored data in R v2.0.9

This manual is meant to provide an introduction to using icenReg to analyze interval censored data. It is written with expectation that the ......