Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC Sample weight invariance properties

See original GitHub issue

This can wait after the release.

A discussion happened in the GLM PR https://github.com/scikit-learn/scikit-learn/pull/14300 about what properties we would like sample_weight to have.

First, a short side comment about 3 ways simple weights (s_i) are currently used in loss functions with regularized generalized linear models in scikit-learn (as far as I understand),

For instance: Ridge (also LogisticRegression where C=1/α)
For instance: SGDClassifier? (maybe Lasso, ElasticNet once they are added?)
For instance, currently proposed in the GLM PR for PoissonRegressor etc

For sample weight it’s useful to think in term of invariant properties, as they can be directly expressed in common tests. For instance,

checking that zero sample weight is equivalent to ignoring samples in https://github.com/scikit-learn/scikit-learn/pull/15015 (replaced by #17176) helped discovering a number of issues. ~~All of the above formulations should verify this.~~ It is verified only by L_1a and L_2b.

Similarly, paraphrasing https://github.com/scikit-learn/scikit-learn/pull/14300#issuecomment-543177937 other properties we might want to enforce, are,

multiplying some sample weight by N is equivalent to repeating the corresponding samples N times. It is verified only by L_1a and L_2b. Example: For L_2a setting all weights to 2, is equivalent to having 2x more samples only if α = α / 2.
Finally, that scaling sample weight has no effect. This is only verified by L_2b. For both L_1a and L_2a multiplying all samples weights by k is equivalent to setting α = α / k.

This one is more controversial. Against enforcing this,
- there are arguments of keeping a meaning for business metrics (e.g. https://github.com/scikit-learn/scikit-learn/issues/15651)
in favor,
- that we don’t want a coupling between using samples weight and regularization. Example: Say one has a model without sample weights, and one wants to see if applying samples weights (imbalanced dataset, sample uncertainty, etc) improves it. Without this property it’s difficult to conclude: is the evaluation metric better with sample weights, due to those, or simply because we now have a better regularized model? One has to simultaneously consider these two factors.

Whether we want/need consistency between the use of sample weight in metrics in estimators is another question. I’m not convinced we do, since in most cases estimators don’t care about the global scaling of the loss function, and these formulations are equivalent up to a scaling of the regularization parameter. So maybe using the L_1a equivalent expression in metrics could be fine.

In any case, we need to decide the behavior we want. This is a blocker for,

Poisson, Gamma and Tweedie Regression https://github.com/scikit-learn/scikit-learn/pull/14300
adding sample weights in ElasticNet and Lasso https://github.com/scikit-learn/scikit-learn/pull/15436
other tests for sample weights consistency in linear models by @lorentzenchr in https://github.com/scikit-learn/scikit-learn/pull/15554

Note: Ridge actually seem to have a different sample weight behavior for dense and sparse as reported in https://github.com/scikit-learn/scikit-learn/issues/15438

@agramfort 's option on this can be found in https://github.com/scikit-learn/scikit-learn/issues/15651#issuecomment-555210612 (if I understood correctly).

Please correct if I missed something (this could also use a more in depth review of how it is done in other libraries).

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:17 (16 by maintainers)

Top GitHub Comments

1reaction

jnothmancommented, Nov 19, 2019

I think we have a general sense here that ordinarily the three invariances should hold.

However, I think we can find a bit more clarity about the exceptions to that rule, if we flip the question on its head and say: which parameters should be invariant to the scale of the weights (and which not)?

And then we have three classes:

n_samples-sensitive: parameters which (should) disregard the weights and are affected by the number of samples (e.g. DecisionTree*.min_samples_leaf, and currently Ridge.alpha, MLP*.batch_size, *ShuffleSplit.test_size).
scale-sensitive: parameters which (should) be affected by the scale of the weights but not the number of samples, such as DBSCAN.min_samples.
scale-invariant: parameters which (should) be invariant to the scale of the weights (e.g. DecisionTree*.min_weight_fraction_leaf, PoissonRegressor.alpha).

I think we can see valid use cases for each approach for some of these parameters. I think there is scope to argue that we have made the wrong choices (or indeed that the current definition of loss in Ridge wrt alpha and weights is a bug), or that we have been inflexible to relevant use cases, and that we can redefine or recreate some parameters.

1reaction

jeremiedbbcommented, Nov 19, 2019

To me property 2. is what people generally think sample weight are, so we should ensure that. Property 1 is property 2 with N=0.

I can’t say much about property 3 from a use case point of view. Since I come from theoretical physics I like invariants but that’s not a proper argument 😃 As I understood, some estimators use L1a (e.g. lasso) and others L2a (e.g. ridge) to have a good default for the regularization parameter, independent of the number of samples for instance. Enforcing prop 3. would require defining the default value as α * n_samples.

Top Results From Across the Web

Version-Independent Properties of QUIC RFC 8999

Author, Martin Thomson. Last updated, 2021-05-27. Replaces, draft-thomson-quic-invariants. RFC stream, Internet Engineering Task Force (IETF). Formats.

RFC 5474

Informational [Page 1] RFC 5474 Packet Selection and Reporting March 2009 Abstract This document specifies a framework for the PSAMP (Packet SAMPling) ...

Roto-translation Invariant Gram for Global Localization on a ...

representations with invariance properties from local features. ... nition, modifies VLAD [37] with learnable weights and inte-.

Poster Session 3 - ICML

We choose the Boltzmann equation as a typical example, where neural networks serve ... Furthermore, this property can be easily turned into a...

RFC 3086 (Apr 2001, Informational, 24 pages) - Tech-invite

PHB configuration play in its resulting attributes, it is where the forwarding path and the control plane interact. The measurable parameters of a...