AutoMLSearch uses slightly different splits for each pipeline
See original GitHub issueRepro (need to checkout the random-split-seeds
branch because we don’t store the state of the random seed of the data split):
from evalml.demos import load_breast_cancer
from evalml.automl import AutoMLSearch
from evalml.utils.gen_utils import check_random_state_equality
import numpy as np
import itertools
def make_seed_from_state(state):
rs = np.random.RandomState()
rs.set_state(state)
return rs
def check_random_state(state_1, state_2):
rs_1 = make_seed_from_state(state_1)
rs_2 = make_seed_from_state(state_2)
return check_random_state_equality(rs_1, rs_2)
X, y = load_breast_cancer()
automl = AutoMLSearch(max_batches=2, problem_type="binary")
automl.search(X, y)
seeds_equal = []
for i, j in itertools.combinations(range(14), 2):
are_equal = check_random_state(automl.data_split_seeds[i], automl.data_split_seeds[j])
seeds_equal.append(are_equal)
assert not all(seeds_equal)
The issue with having a different random state everytime data_split.split
is called is that the split will be slightly different each time:
from sklearn.model_selection import StratifiedKFold
seed_1 = make_seed_from_state(automl.data_split_seeds[0])
seed_2 = make_seed_from_state(automl.data_split_seeds[1])
split_1 = StratifiedKFold(n_splits=3, random_state=seed_1, shuffle=True)
split_2 = StratifiedKFold(n_splits=3, random_state=seed_2, shuffle=True)
for (train_index_1, test_index_1), (train_index_2, test_index_2) in zip(split_1.split(X, y), split_2.split(X, y)):
assert not set(train_index_1) == set(train_index_2)
assert not set(test_index_1) == set(test_index_2)
I think we should change this because it is introducing more variability into the results of automl than is necessary and prevents a true apples-to-apples comparison between pipelines. That being said, I don’t think fixing this would substantially impact the results of automl search (the pipeline ranking would probably be the same).
One possible solution is to create the split class with an integer random seed as opposed to the np.random.RandomState
that is stored in the automl state. I believe the indices will be the same in repeated calls:
from sklearn.model_selection import StratifiedKFold
split_1 = StratifiedKFold(n_splits=3, random_state=10, shuffle=True)
first_train_set = []
first_test_set = []
for (train_index_1, test_index_1) in split_1.split(X, y):
first_train_set.append(set(train_index_1))
first_test_set.append(set(test_index_1))
second_train_set = []
second_test_set = []
for (train_index_2, test_index_2) in split_1.split(X, y):
second_train_set.append(set(train_index_2))
second_test_set.append(set(test_index_2))
assert first_train_set == second_train_set
assert first_test_set == second_test_set
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
@bchen1116
get_random_state
returns anp.random.RandomState
:I think since we create the splitter with a reference to this mutable random state, calling
split
for one pipeline will change the state of the random state for the next pipeline when we callsplit
again. For that reason, the splits are slightly different for each pipeline.@dsherry I think we may have to pass in an int as the
random_state
in the default splits. Usingnp.random.RandomState
can lead to different splits in subsequent calls. For example, this will fail:However, passing in
random_state=10
will pass.I think an alternative will be to create a new split instance for each pipeline but I don’t like that better.