Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use the function check_scalar for parameters validation

See original GitHub issue

Background / Objective

Use the function check_scalar for parameters validation. The validation function checks to see the following for a parameter: is an acceptable data type, is within the range of values, the range of values (interval).

References Issue #20724: “Use check_scalar for parameters validation” (with notes by @glemaitre, @jjerphan, @genvalen)
References PR #20723. “MNT use check_scalar to validate scalar in AffinityPropagation”. This is an example PR by @glemaitre.

A helper function exists in scikit-learn which validates a scalar value: sklearn.utils.check_scalar documentation. It is used to validate parameters of classes (? and functions). Most of the current classes in scikit-learn do not use this helper function. We want to refactor the code so that it does use this standard helper function. Utilizing this helper function will help to get consistent error types and messages.

If there is a scalar argument that isn’t being checked, we want to check it, or validate it using the check_scalar function. In some cases it is currently being checked, but it is not using the check_scalar function. For that change, we refactor the code. (Refactoring means making changes to the code that result in the same output as before.)

The function check_scalar is defined in scikit-learn/sklearn/utils/validation.py.

Prerequisites

This is an Intermediate-level issue for second time contributors. This requires the following experience:

You have already set up your working virtual environment.
You have submitted at least one other pull request to this library. (You are familiar with using git and submitting pull requests.)
Be familiar with the scikit-learn code base.
Experience using pytest.
To find the range of possible for values for an estimator, that information might be available if some validation code has already been written in the scikit-learn library.
Sometimes validation code is not available in the scikit-learn library. It is helpful to be familiar with the acceptable range of values (minimum and maximum) for the arguments for the estimator you are working on. If you are not familiar with an estimator, you can reference other sources outside of scikit-learn documentation to get that information.

Steps

Make sure you have activated your virtual environment.
Make sure you have created a separate branch from main before editing files for your new contribution. Refer to our contributing guidelines for more information.
Find a class with constructors that have scalar numeric as parameters. There are some listed below in the “Classes to Update” section.
Work on one estimator at a time and submit each in a separate pull request.
Identify the scalar numeric parameters (those of type int, float) for that class.
- Examples of scalar parameters are: alpha,damping, max_iter, and convergence_iter, tol, verbose.
- You can infer if it is a type scalar by looking at the documentation.
- Example PR: AffinityPropagation scalar parameters
For each of the scalar numeric parameters, determine the acceptable range of values. Look at minimum and maximum values. Sometimes that information is included in the parameter definition in the documentation. Sometimes you may need to reference other sources. If minimum and maximum values are missing, we should add them.
Add tests. Note: the tests must fail before adding validation. Example PR by @glemaitre added a parametrised test for parameters.
If any of the associated class attributes, which are scalar numeric, but are not being checked with check_scalar, are ones that can be done.
Validation should be within the def fit function. Validation is when check_scalar is added to the class. Add check_scalar calls where needed. Generally, this is not done in the constructor but rather just before calling the core of the method. For instance, in the case of #20723, @glemaitre added check_scalar calls just before the call to affinity_propagation which is the core of the method.

Notes

The pull request can be named: “MAINT Use check_scalar to validate scalar in: [EstimatorName]”
Work on one estimator at a time and submit each in a separate pull request.
Within an estimator there may be multiple scalar arguments. (For one estimator, validation for multiple arguments - should be submitted in one pull request.)
Include explicit parameter names (even if they are not required), as a best practice. In this function, the parameter name is not required, meaning it is not a keyword on the argument. You should include it in the function call for readability.

check_scalar(
  self.learning_rate,
  name="learning_rate",
  target_type=numbers.Real,
  min_val=0,
  max_val=None,  #default
  include_boundaries="both", #default
)

Tests

Suggestion: You may want to write the test before writing the validation code. When doing the test first, it gives you an idea of where the existing validation is. If validation exists, it will give you the range of possible values. Writing the test lets you check for that.

Generally speaking, this is how to connect the .py file with its associated test. Check to see if the test exists in the test_*.py file. If it does not, we will need to create a test.

Where the class is: sklearn/cluster/_affinity_propagation.py
Where the related class test file is: sklearn/cluster/tests/test_affinity_propagation.py
The name of the test: def test_affinity_propagation_params_validation(....)

The point of a test is that if an incorrect parameter value is given, the program gives an error message. We want to test for values that are outside of the acceptable range. We want to make sure the program is catching that. To run an individual validation test, here are examples of the code to run at the terminal:

pytest sklearn/cluster/tests/test_affinity_propagation.py::test_affinity_propagation_params_validation
pytest sklearn/linear_model/_glm/tests/test_glm.py::test_glm_max_iter_argument

Consistency Checks for Reviewers

PR prefix should be MAINT (not MNT)
check_scalar call should include explicitly include name (Ex: name="n_estimators", (not "n_estimators", ))
Interval ranges should use the text must be (not should be)
Ensure error messages in tests are present

Examples for Reference

sklearn/cluster/_affinity_propagation.py (@glemaitre) #20723
sklearn/linear_model/_ridge.py (@ArturoAmorQ) #21341

Classes Updated

sklearn/neighbors/_nca.py
sklearn/decomposition/_pca.py
sklearn/feature_extraction/text.py (@AlekLefebvre) #20752
sklearn/preprocessing/_discretization.py
sklearn/cluster/_affinity_propagation.py (@glemaitre) #20723
sklearn/cluster/_birch.py (@SanjayMarreddi) #20816
sklearn/cluster/_dbscan.py (@SanjayMarreddi) #20816
sklearn/ensemble/_weight_boosting.py (AdaBoostClassifier) (@genvalen) #21442
sklearn/linear_model/_ridge.py (Ridge) @ArturoAmorQ) #21341
sklearn/linear_model/_ridge.py (RidgeCV) @ArturoAmorQ) #21606
sklearn/ensemble/_weight_boosting.py (AdaBoostRegressor) (@genvalen) #21605
sklearn/ensemble/_voting.py (VotingClassifier, VotingRegressor) (@genvalen) #22204
sklearn/linear_model/_glm/glm.py (GeneralizedLinearRegressor) (@reshamas) #21946 - [x] sklearn/linear_model/_glm/glm.py (PoissonRegressor) (@reshamas) - [x] sklearn/linear_model/_glm/glm.py (GammaRegressor) (@reshamas) - [x] sklearn/linear_model/_glm/glm.py (TweedieRegressor) (@reshamas)
sklearn/tree/_classes.py (BaseDecisionTree) (@genvalen)#21990
sklearn/cluster/_bicluster.py (SpectralBiClustering) (@creatornadiran) #20817
sklearn/cluster/_bicluster.py (SpectralCoClustering) (@creatornadiran) #20817
sklearn/cluster/_bicluster.py (SpectralClustering) (@hvassard) #21881
sklearn/ensemble/_gb.py (BaseGradientBoosting) (@genvalen)#21632
sklearn/linear_model/_coordinate_descent.py (LassoCV) (@ArturoAmorQ) #22305
sklearn/linear_model/_ridge.py (RidgeCV) (@ArturoAmorQ) #21606

Classes to Update

sklearn/linear_model/_coordinate_descent.py (Lasso) (@ArturoAmorQ)
sklearn/linear_model/_stochastic_gradient.py (SGDClassifier) (@reshamas)
- add valid intervals: #22115
sklearn/linear_model/_bayes (BayesianRidge) (@matiasrvazquez)
sklearn/linear_model/_bayes (ARDRegression) (@matiasrvazquez)
sklearn/ensemble/_stacking.py (StackingClassifier) (@genvalen)
sklearn/ensemble/_stacking.py (StackingRegressor) (@genvalen)

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:7 (7 by maintainers)

Top GitHub Comments

3reactions

jeremiedbbcommented, May 25, 2022

I think this issue should be closed in favor https://github.com/scikit-learn/scikit-learn/issues/23462 which is the new way to do param validation.

1reaction

matiasrvazquezcommented, Feb 14, 2022

Hi all! I’d like to contribute to this issue. It seems all classes in the list “classes to update” have someone working on them, so I looked for some other estimators to work on.

I will be working on sklearn.linear_model._bayes, which includes the classes BayesianRidge and ARDRegression

Top Results From Across the Web

Example of check_scalar Function Contribution in scikit-learn

The validation function checks to see the following for a parameter: is an acceptable data type, is within the range of values, the...

Parameter Validation | Microsoft Learn

A description of parameter validation in the Microsoft C runtime library.

Best Method of function parameter validation

One of the decisions I keep coming back to is how best to validate incoming parameters for functions. This is mostly in reference...

PowerShell Function Parameter Validation - Stack Overflow

The script block must output a Boolean that indicates whether the value is valid - (effectively) $true ) - or not - (effectively)...

API00-C. Functions should validate their parameters

Validity checks allow the function to survive at least some forms of improper usage, enabling an application using the function to likewise survive....