Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AutoMLSearch uses slightly different splits for each pipeline

See original GitHub issue

Repro (need to checkout the random-split-seeds branch because we don’t store the state of the random seed of the data split):

from evalml.demos import load_breast_cancer
from evalml.automl import AutoMLSearch
from evalml.utils.gen_utils import check_random_state_equality
import numpy as np
import itertools

def make_seed_from_state(state):
    rs = np.random.RandomState()
    rs.set_state(state)
    return rs

def check_random_state(state_1, state_2):
    rs_1 = make_seed_from_state(state_1)
    rs_2 = make_seed_from_state(state_2)
    return check_random_state_equality(rs_1, rs_2)

X, y = load_breast_cancer()

automl = AutoMLSearch(max_batches=2, problem_type="binary")
automl.search(X, y)

seeds_equal = []
for i, j in itertools.combinations(range(14), 2): 
    are_equal = check_random_state(automl.data_split_seeds[i], automl.data_split_seeds[j])
    seeds_equal.append(are_equal)
    
assert not all(seeds_equal)

The issue with having a different random state everytime data_split.split is called is that the split will be slightly different each time:

from sklearn.model_selection import StratifiedKFold
seed_1 = make_seed_from_state(automl.data_split_seeds[0])
seed_2 = make_seed_from_state(automl.data_split_seeds[1])
split_1 = StratifiedKFold(n_splits=3, random_state=seed_1, shuffle=True)
split_2 = StratifiedKFold(n_splits=3, random_state=seed_2, shuffle=True)

for (train_index_1, test_index_1), (train_index_2, test_index_2) in zip(split_1.split(X, y), split_2.split(X, y)):
    assert not set(train_index_1) == set(train_index_2)
    assert not set(test_index_1) == set(test_index_2)

I think we should change this because it is introducing more variability into the results of automl than is necessary and prevents a true apples-to-apples comparison between pipelines. That being said, I don’t think fixing this would substantially impact the results of automl search (the pipeline ranking would probably be the same).

One possible solution is to create the split class with an integer random seed as opposed to the np.random.RandomState that is stored in the automl state. I believe the indices will be the same in repeated calls:

from sklearn.model_selection import StratifiedKFold
split_1 = StratifiedKFold(n_splits=3, random_state=10, shuffle=True)

first_train_set = []
first_test_set = []
for (train_index_1, test_index_1) in split_1.split(X, y):
    first_train_set.append(set(train_index_1))
    first_test_set.append(set(test_index_1))

second_train_set = []
second_test_set = []
for (train_index_2, test_index_2) in split_1.split(X, y):
    second_train_set.append(set(train_index_2))
    second_test_set.append(set(test_index_2))

assert first_train_set == second_train_set
assert first_test_set == second_test_set

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

freddyaboultoncommented, Dec 17, 2020

@bchen1116 get_random_state returns a np.random.RandomState:

I think since we create the splitter with a reference to this mutable random state, calling split for one pipeline will change the state of the random state for the next pipeline when we call split again. For that reason, the splits are slightly different for each pipeline.

1reaction

freddyaboultoncommented, Dec 10, 2020

@dsherry I think we may have to pass in an int as the random_state in the default splits. Using np.random.RandomState can lead to different splits in subsequent calls. For example, this will fail:

from sklearn.model_selection import StratifiedKFold
from evalml.demos import load_breast_cancer
split_1 = StratifiedKFold(n_splits=3, random_state=np.random.RandomState(10), shuffle=True)

X, y = load_breast_cancer()
X = X.to_dataframe()
y = y.to_series()

first_train_set = []
first_test_set = []
for (train_index_1, test_index_1) in split_1.split(X, y):
    first_train_set.append(set(train_index_1))
    first_test_set.append(set(test_index_1))

second_train_set = []
second_test_set = []
for (train_index_2, test_index_2) in split_1.split(X, y):
    second_train_set.append(set(train_index_2))
    second_test_set.append(set(test_index_2))

assert first_train_set == second_train_set
assert first_test_set == second_test_set

However, passing in random_state=10 will pass.

from sklearn.model_selection import StratifiedKFold
from evalml.demos import load_breast_cancer
split_1 = StratifiedKFold(n_splits=3, random_state=10, shuffle=True)

X, y = load_breast_cancer()
X = X.to_dataframe()
y = y.to_series()

first_train_set = []
first_test_set = []
for (train_index_1, test_index_1) in split_1.split(X, y):
    first_train_set.append(set(train_index_1))
    first_test_set.append(set(test_index_1))

second_train_set = []
second_test_set = []
for (train_index_2, test_index_2) in split_1.split(X, y):
    second_train_set.append(set(train_index_2))
    second_test_set.append(set(test_index_2))

assert first_train_set == second_train_set
assert first_test_set == second_test_set

I think an alternative will be to create a new split instance for each pipeline but I don’t like that better.

Top Results From Across the Web

Automated Machine Learning (AutoML) Search - EvalML

During the search, AutoML will explore different combinations of model type, ... AutoMLSearch will use the holdout set to score and rank pipelines....

Stacked ensemble performing poorly · Issue #2093 - GitHub

Our ensembler is currently constructed by taking the best pipeline of each model family found and using that as the input pipelines for...

Automating Machine Learning tasks using EvalML Library

Having used AutoMLSearch, now let's see how to different pipelines rankings, their components, performance, etc. Use the automl.rankings to get ...

About data splits for AutoML models | Vertex AI - Google Cloud

AutoML uses data splits differently depending on the data type of the training data. This page describes data splits for image, text, and...

An adaptive AutoML framework for online learning - arXiv

Automated Machine Learning (AutoML) has been used successfully in ... Evolutionary mwthods are another effective approach for pipeline.