Design how we're going to extend Bambi
See original GitHub issueThe following is a list of features we’re missing (or covering only partially) in Bambi
- Distributional models (we model more than the mean parameter of the response)
- Multivariate models (ie the response is a multivariate distribution)
- Non-linear models.
- Survival models/Models with censored data.
- Ordinal models.
- Zero and Zero-One inflated models.
The last three points (survival/censored, ordinal, and zero/zero-one inflated) are covered by the first points (distributional and multivariate) if we implement them appropriately. The third point, non-linear models, is a separate problem. I’ll try to add a couple of things I’ve been thinking about lately.
Distributional models
Some API proposals
formula = bmb.formula(
"y ~ a + b",
"sigma ~ a",
)
priors = {
"a": bmb.Prior("Normal", mu=0, sigma=1),
"b": bmb.Prior("Normal", mu=0, sigma=1),
"sigma_Intercept": bmb.Prior("Normal", mu=0, sigma=1),
"sigma_x": bmb.Prior("Normal", mu=0, sigma=1)
}
link = {"mu": "identity", "sigma": "log"}
model = bmb.Model(formula, data, priors, link)
- We need a formula object where we can have multiple formula parts. I propose to call it
bmb.formula()
. There’s an open discussion in #423. - We need a name for the terms associated with the auxiliary parameters. I propose to use
{param}_{term}
such assigma_x
. - We need a transformation of the linear predictor of the auxiliary parameters into something that makes sense. I propose we have defaults for the built-in families that can be overridden with a dictionary. Note a dictionary is not supported by the
link
argument inModel
now.
I haven’t thought much more about the implementation details, where other concerns may appear. For the moment, I think it’s good to discuss about the API we want. Any objections, any suggestions, any drawbacks I’m not seeing?
Multivariate models
We currently support some multivariate families, such as "categorical"
and "multinomial"
. I feel we should think more about the implementation. I think we could make it more general so we don’t need to handle all cases as special cases. With that said, I think there are other things to discuss.
- What do we use to indicate a multivariate response?
"c(y1, y2, ..., yn) ~ ..."
"mvbind(y1, y2, ..., yn) ~ ..."
bmb.formula("y1 ~ ...", "y2 ~ ...", "y3 ~ ...")
note the last alternative allows for different predictors to be included in each case.
- How much do we want to support multivariate families?
I’m not an expert in this area but I have the feeling that things can get very complex very quickly. And I’m not sure if this is a highly required feature.
For now, I tend to think we should have minimum support that allows people and us to explore the possibilities available as well as refine the API.
Non-linear models
This has been discussed a little here #448. I think it’s a very nice to have feature but I don’t have it solved in my mind yet. The only thing I have are some API proposals, but I don’t see how to implement them without a huge effort.
First:
formula = bmb.formula(
"y ~ b1 * np.exp(b2 * x)",
nlpars=("b1", "b2")
)
But this comes with a major problem, how do we override the meaning of the *
operator in the formula syntax? If we pass something like that to formulae, it won’t multiply things by b1
or b2
, it will try to construct full interaction terms between the operands. I like how this approach looks but it would require a huge amount of effort to parse terms and parameters.
Another alternative would be to use a function.
def f(x, b1, b2):
return b1 * np.exp(b2 * x)
formula = bmb.formula(
"y ~ f(x, b1, b2)",
nlpars=("b1", "b2")
)
This would work on the formulae side, but again we would need to do parsing stuff to grab the non-linear relationship between the parameters (b1
and b2
) and the predictor x
. How do we handle arbitrarily complex functions? I’m not sure.
Survival models/Models with censored data.
#543 adds support for survival analysis with right-censored data. One drawback of the proposal is that family="exponential"
always implies right-censored data. I think we should have something more general.
I imagine all the following cases working
bmb.Model("y ~ ...", data, family="exponential")
bmb.Model("censored(y, status) ~ ...", data, family="exponential")
bmb.Model("censored(y, status, 'left') ~ ...", data, family="exponential")
The challenge is that censored()
should be a function that returns an array-like structure (so formulae
knows how to handle it) with some attribute that enables Bambi to figure out the characteristics of the censoring. I’m not sure how to implement this but I know it’s feasible.
Ordinal models and Zero and Zero-One inflated models.
I think these ones come almost for free if we do a good job with the tasks above.
Issue Analytics
- State:
- Created a year ago
- Comments:9
Other than the technical implementation frankly I dont think itll be all that usefull and theres not a huge userbase for it. If people want non linear models they can just use PyMC to code those up.
The other use cases imo are much easier to implement in Bambi and will have a wider userbase.
Nice. I like that structure - very clear. Distributional models are a very cool addition!
On Sat, Oct 22, 2022 at 15:08 Tomás Capretto @.***> wrote: