question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC Sample weight invariance properties

See original GitHub issue

This can wait after the release.

A discussion happened in the GLM PR https://github.com/scikit-learn/scikit-learn/pull/14300 about what properties we would like sample_weight to have.

First, a short side comment about 3 ways simple weights (s_i) are currently used in loss functions with regularized generalized linear models in scikit-learn (as far as I understand),

  • For instance: Ridge (also LogisticRegression where C=1/α)

  • For instance: SGDClassifier? (maybe Lasso, ElasticNet once they are added?)

  • For instance, currently proposed in the GLM PR for PoissonRegressor etc

For sample weight it’s useful to think in term of invariant properties, as they can be directly expressed in common tests. For instance,

  1. checking that zero sample weight is equivalent to ignoring samples in https://github.com/scikit-learn/scikit-learn/pull/15015 (replaced by #17176) helped discovering a number of issues. All of the above formulations should verify this. It is verified only by L_1a and L_2b.

Similarly, paraphrasing https://github.com/scikit-learn/scikit-learn/pull/14300#issuecomment-543177937 other properties we might want to enforce, are,

  1. multiplying some sample weight by N is equivalent to repeating the corresponding samples N times. It is verified only by L_1a and L_2b. Example: For L_2a setting all weights to 2, is equivalent to having 2x more samples only if α = α / 2.

  2. Finally, that scaling sample weight has no effect. This is only verified by L_2b. For both L_1a and L_2a multiplying all samples weights by k is equivalent to setting α = α / k.

    This one is more controversial. Against enforcing this,

    in favor,

    • that we don’t want a coupling between using samples weight and regularization. Example: Say one has a model without sample weights, and one wants to see if applying samples weights (imbalanced dataset, sample uncertainty, etc) improves it. Without this property it’s difficult to conclude: is the evaluation metric better with sample weights, due to those, or simply because we now have a better regularized model? One has to simultaneously consider these two factors.

Whether we want/need consistency between the use of sample weight in metrics in estimators is another question. I’m not convinced we do, since in most cases estimators don’t care about the global scaling of the loss function, and these formulations are equivalent up to a scaling of the regularization parameter. So maybe using the L_1a equivalent expression in metrics could be fine.

In any case, we need to decide the behavior we want. This is a blocker for,

Note: Ridge actually seem to have a different sample weight behavior for dense and sparse as reported in https://github.com/scikit-learn/scikit-learn/issues/15438

@agramfort 's option on this can be found in https://github.com/scikit-learn/scikit-learn/issues/15651#issuecomment-555210612 (if I understood correctly).

Please correct if I missed something (this could also use a more in depth review of how it is done in other libraries).

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:3
  • Comments:17 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
jnothmancommented, Nov 19, 2019

I think we have a general sense here that ordinarily the three invariances should hold.

However, I think we can find a bit more clarity about the exceptions to that rule, if we flip the question on its head and say: which parameters should be invariant to the scale of the weights (and which not)?

And then we have three classes:

  1. n_samples-sensitive: parameters which (should) disregard the weights and are affected by the number of samples (e.g. DecisionTree*.min_samples_leaf, and currently Ridge.alpha, MLP*.batch_size, *ShuffleSplit.test_size).
  2. scale-sensitive: parameters which (should) be affected by the scale of the weights but not the number of samples, such as DBSCAN.min_samples.
  3. scale-invariant: parameters which (should) be invariant to the scale of the weights (e.g. DecisionTree*.min_weight_fraction_leaf, PoissonRegressor.alpha).

I think we can see valid use cases for each approach for some of these parameters. I think there is scope to argue that we have made the wrong choices (or indeed that the current definition of loss in Ridge wrt alpha and weights is a bug), or that we have been inflexible to relevant use cases, and that we can redefine or recreate some parameters.

1reaction
jeremiedbbcommented, Nov 19, 2019

To me property 2. is what people generally think sample weight are, so we should ensure that. Property 1 is property 2 with N=0.

I can’t say much about property 3 from a use case point of view. Since I come from theoretical physics I like invariants but that’s not a proper argument 😃 As I understood, some estimators use L1a (e.g. lasso) and others L2a (e.g. ridge) to have a good default for the regularization parameter, independent of the number of samples for instance. Enforcing prop 3. would require defining the default value as α * n_samples.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Version-Independent Properties of QUIC RFC 8999
Author, Martin Thomson. Last updated, 2021-05-27. Replaces, draft-thomson-quic-invariants. RFC stream, Internet Engineering Task Force (IETF). Formats.
Read more >
RFC 5474
Informational [Page 1] RFC 5474 Packet Selection and Reporting March 2009 Abstract This document specifies a framework for the PSAMP (Packet SAMPling) ...
Read more >
Roto-translation Invariant Gram for Global Localization on a ...
representations with invariance properties from local features. ... nition, modifies VLAD [37] with learnable weights and inte-.
Read more >
Poster Session 3 - ICML
We choose the Boltzmann equation as a typical example, where neural networks serve ... Furthermore, this property can be easily turned into a...
Read more >
RFC 3086 (Apr 2001, Informational, 24 pages) - Tech-invite
PHB configuration play in its resulting attributes, it is where the forwarding path and the control plane interact. The measurable parameters of a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found