Use the function check_scalar for parameters validation
See original GitHub issueBackground / Objective
Use the function check_scalar
for parameters validation. The validation function checks to see the following for a parameter: is an acceptable data type, is within the range of values, the range of values (interval).
- References Issue #20724: “Use check_scalar for parameters validation” (with notes by @glemaitre, @jjerphan, @genvalen)
- References PR #20723. “MNT use check_scalar to validate scalar in AffinityPropagation”. This is an example PR by @glemaitre.
A helper function exists in scikit-learn which validates a scalar value: sklearn.utils.check_scalar
documentation.
It is used to validate parameters of classes (? and functions). Most of the current classes in scikit-learn do not use this helper function. We want to refactor the code so that it does use this standard helper function. Utilizing this helper function will help to get consistent error types and messages.
If there is a scalar argument that isn’t being checked, we want to check it, or validate it using the check_scalar
function. In some cases it is currently being checked, but it is not using the check_scalar
function. For that change, we refactor the code. (Refactoring means making changes to the code that result in the same output as before.)
The function check_scalar
is defined in scikit-learn/sklearn/utils/validation.py
.
Prerequisites
This is an Intermediate-level issue for second time contributors. This requires the following experience:
- You have already set up your working virtual environment.
- You have submitted at least one other pull request to this library. (You are familiar with using git and submitting pull requests.)
- Be familiar with the scikit-learn code base.
- Experience using
pytest
. - To find the range of possible for values for an estimator, that information might be available if some validation code has already been written in the scikit-learn library.
- Sometimes validation code is not available in the scikit-learn library. It is helpful to be familiar with the acceptable range of values (minimum and maximum) for the arguments for the estimator you are working on. If you are not familiar with an estimator, you can reference other sources outside of scikit-learn documentation to get that information.
Steps
- Make sure you have activated your virtual environment.
- Make sure you have created a separate branch from
main
before editing files for your new contribution. Refer to our contributing guidelines for more information. - Find a class with constructors that have scalar numeric as parameters. There are some listed below in the “Classes to Update” section.
- Work on one estimator at a time and submit each in a separate pull request.
- Identify the scalar numeric parameters (those of type
int
,float
) for that class.- Examples of scalar parameters are:
alpha
,damping
,max_iter
, andconvergence_iter
,tol
,verbose
. - You can infer if it is a type scalar by looking at the documentation.
- Example PR:
AffinityPropagation
scalar parameters
- Examples of scalar parameters are:
- For each of the scalar numeric parameters, determine the acceptable range of values. Look at minimum and maximum values. Sometimes that information is included in the parameter definition in the documentation. Sometimes you may need to reference other sources. If minimum and maximum values are missing, we should add them.
- Add tests. Note: the tests must fail before adding validation. Example PR by @glemaitre added a parametrised test for parameters.
- If any of the associated class attributes, which are scalar numeric, but are not being checked with
check_scalar
, are ones that can be done. - Validation should be within the
def fit
function. Validation is whencheck_scalar
is added to the class. Addcheck_scalar
calls where needed. Generally, this is not done in the constructor but rather just before calling the core of the method. For instance, in the case of #20723, @glemaitre addedcheck_scalar
calls just before the call toaffinity_propagation
which is the core of the method.
Notes
- The pull request can be named: “MAINT Use check_scalar to validate scalar in: [EstimatorName]”
- Work on one estimator at a time and submit each in a separate pull request.
- Within an estimator there may be multiple scalar arguments. (For one estimator, validation for multiple arguments - should be submitted in one pull request.)
- Include explicit parameter names (even if they are not required), as a best practice. In this function, the parameter
name
is not required, meaning it is not a keyword on the argument. You should include it in the function call for readability.
check_scalar(
self.learning_rate,
name="learning_rate",
target_type=numbers.Real,
min_val=0,
max_val=None, #default
include_boundaries="both", #default
)
Tests
Suggestion: You may want to write the test before writing the validation code. When doing the test first, it gives you an idea of where the existing validation is. If validation exists, it will give you the range of possible values. Writing the test lets you check for that.
Generally speaking, this is how to connect the .py
file with its associated test. Check to see if the test exists in the test_*.py
file. If it does not, we will need to create a test.
- Where the class is:
sklearn/cluster/_affinity_propagation.py
- Where the related class test file is:
sklearn/cluster/tests/test_affinity_propagation.py
- The name of the test:
def test_affinity_propagation_params_validation(....)
The point of a test is that if an incorrect parameter value is given, the program gives an error message. We want to test for values that are outside of the acceptable range. We want to make sure the program is catching that. To run an individual validation test, here are examples of the code to run at the terminal:
pytest sklearn/cluster/tests/test_affinity_propagation.py::test_affinity_propagation_params_validation
pytest sklearn/linear_model/_glm/tests/test_glm.py::test_glm_max_iter_argument
Consistency Checks for Reviewers
- PR prefix should be
MAINT
(notMNT
) check_scalar
call should include explicitly includename
(Ex:name="n_estimators",
(not"n_estimators",
))- Interval ranges should use the text
must be
(notshould be
) - Ensure error messages in tests are present
Examples for Reference
-
sklearn/cluster/_affinity_propagation.py
(@glemaitre) #20723 -
sklearn/linear_model/_ridge.py
(@ArturoAmorQ) #21341
Classes Updated
-
sklearn/neighbors/_nca.py
-
sklearn/decomposition/_pca.py
-
sklearn/feature_extraction/text.py
(@AlekLefebvre) #20752 -
sklearn/preprocessing/_discretization.py
-
sklearn/cluster/_affinity_propagation.py
(@glemaitre) #20723 -
sklearn/cluster/_birch.py
(@SanjayMarreddi) #20816 -
sklearn/cluster/_dbscan.py
(@SanjayMarreddi) #20816 -
sklearn/ensemble/_weight_boosting.py
(AdaBoostClassifier) (@genvalen) #21442 -
sklearn/linear_model/_ridge.py
(Ridge) @ArturoAmorQ) #21341 -
sklearn/linear_model/_ridge.py
(RidgeCV) @ArturoAmorQ) #21606 -
sklearn/ensemble/_weight_boosting.py
(AdaBoostRegressor) (@genvalen) #21605 -
sklearn/ensemble/_voting.py
(VotingClassifier, VotingRegressor) (@genvalen) #22204 -
sklearn/linear_model/_glm/glm.py
(GeneralizedLinearRegressor) (@reshamas) #21946 - [x]sklearn/linear_model/_glm/glm.py
(PoissonRegressor) (@reshamas) - [x]sklearn/linear_model/_glm/glm.py
(GammaRegressor) (@reshamas) - [x]sklearn/linear_model/_glm/glm.py
(TweedieRegressor) (@reshamas) -
sklearn/tree/_classes.py
(BaseDecisionTree) (@genvalen)#21990 -
sklearn/cluster/_bicluster.py
(SpectralBiClustering) (@creatornadiran) #20817 -
sklearn/cluster/_bicluster.py
(SpectralCoClustering) (@creatornadiran) #20817 -
sklearn/cluster/_bicluster.py
(SpectralClustering) (@hvassard) #21881 -
sklearn/ensemble/_gb.py
(BaseGradientBoosting) (@genvalen)#21632 -
sklearn/linear_model/_coordinate_descent.py
(LassoCV) (@ArturoAmorQ) #22305 -
sklearn/linear_model/_ridge.py
(RidgeCV) (@ArturoAmorQ) #21606
Classes to Update
-
sklearn/linear_model/_coordinate_descent.py
(Lasso) (@ArturoAmorQ) -
sklearn/linear_model/_stochastic_gradient.py
(SGDClassifier) (@reshamas)
- add valid intervals: #22115 -
sklearn/linear_model/_bayes
(BayesianRidge) (@matiasrvazquez) -
sklearn/linear_model/_bayes
(ARDRegression) (@matiasrvazquez) -
sklearn/ensemble/_stacking.py
(StackingClassifier) (@genvalen) -
sklearn/ensemble/_stacking.py
(StackingRegressor) (@genvalen)
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:7 (7 by maintainers)
Top GitHub Comments
I think this issue should be closed in favor https://github.com/scikit-learn/scikit-learn/issues/23462 which is the new way to do param validation.
Hi all! I’d like to contribute to this issue. It seems all classes in the list “classes to update” have someone working on them, so I looked for some other estimators to work on.
I will be working on
sklearn.linear_model._bayes
, which includes the classesBayesianRidge
andARDRegression