RFC Sample weight invariance properties
See original GitHub issueThis can wait after the release.
A discussion happened in the GLM PR https://github.com/scikit-learn/scikit-learn/pull/14300 about what properties we would like sample_weight
to have.
First, a short side comment about 3 ways simple weights (s_i
) are currently used in loss functions with regularized generalized linear models in scikit-learn (as far as I understand),
-
For instance:
Ridge
(alsoLogisticRegression
whereC=1/α
) -
For instance:
SGDClassifier
? (maybeLasso
,ElasticNet
once they are added?) -
For instance, currently proposed in the GLM PR for
PoissonRegressor
etc
For sample weight it’s useful to think in term of invariant properties, as they can be directly expressed in common tests. For instance,
- checking that zero sample weight is equivalent to ignoring samples in https://github.com/scikit-learn/scikit-learn/pull/15015 (replaced by #17176) helped discovering a number of issues.
All of the above formulations should verify this.It is verified only byL_1a
andL_2b
.
Similarly, paraphrasing https://github.com/scikit-learn/scikit-learn/pull/14300#issuecomment-543177937 other properties we might want to enforce, are,
-
multiplying some sample weight by
N
is equivalent to repeating the corresponding samples N times. It is verified only byL_1a
andL_2b
. Example: ForL_2a
setting all weights to 2, is equivalent to having 2x more samples only ifα = α / 2
. -
Finally, that scaling sample weight has no effect. This is only verified by
L_2b
. For bothL_1a
andL_2a
multiplying all samples weights byk
is equivalent to settingα = α / k
.This one is more controversial. Against enforcing this,
- there are arguments of keeping a meaning for business metrics (e.g. https://github.com/scikit-learn/scikit-learn/issues/15651)
in favor,
- that we don’t want a coupling between using samples weight and regularization. Example: Say one has a model without sample weights, and one wants to see if applying samples weights (imbalanced dataset, sample uncertainty, etc) improves it. Without this property it’s difficult to conclude: is the evaluation metric better with sample weights, due to those, or simply because we now have a better regularized model? One has to simultaneously consider these two factors.
Whether we want/need consistency between the use of sample weight in metrics in estimators is another question. I’m not convinced we do, since in most cases estimators don’t care about the global scaling of the loss function, and these formulations are equivalent up to a scaling of the regularization parameter. So maybe using the L_1a
equivalent expression in metrics could be fine.
In any case, we need to decide the behavior we want. This is a blocker for,
- Poisson, Gamma and Tweedie Regression https://github.com/scikit-learn/scikit-learn/pull/14300
- adding sample weights in
ElasticNet
and Lasso https://github.com/scikit-learn/scikit-learn/pull/15436 - other tests for sample weights consistency in linear models by @lorentzenchr in https://github.com/scikit-learn/scikit-learn/pull/15554
Note: Ridge actually seem to have a different sample weight behavior for dense and sparse as reported in https://github.com/scikit-learn/scikit-learn/issues/15438
@agramfort 's option on this can be found in https://github.com/scikit-learn/scikit-learn/issues/15651#issuecomment-555210612 (if I understood correctly).
Please correct if I missed something (this could also use a more in depth review of how it is done in other libraries).
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:17 (16 by maintainers)
I think we have a general sense here that ordinarily the three invariances should hold.
However, I think we can find a bit more clarity about the exceptions to that rule, if we flip the question on its head and say: which parameters should be invariant to the scale of the weights (and which not)?
And then we have three classes:
DecisionTree*.min_samples_leaf
, and currentlyRidge.alpha
,MLP*.batch_size
,*ShuffleSplit.test_size
).DBSCAN.min_samples
.DecisionTree*.min_weight_fraction_leaf
,PoissonRegressor.alpha
).I think we can see valid use cases for each approach for some of these parameters. I think there is scope to argue that we have made the wrong choices (or indeed that the current definition of loss in
Ridge
wrtalpha
and weights is a bug), or that we have been inflexible to relevant use cases, and that we can redefine or recreate some parameters.To me property 2. is what people generally think sample weight are, so we should ensure that. Property 1 is property 2 with N=0.
I can’t say much about property 3 from a use case point of view. Since I come from theoretical physics I like invariants but that’s not a proper argument 😃 As I understood, some estimators use L1a (e.g. lasso) and others L2a (e.g. ridge) to have a good default for the regularization parameter, independent of the number of samples for instance. Enforcing prop 3. would require defining the default value as α * n_samples.