Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error of random seed when using train_test_split()

See original GitHub issue

I get this error when executing the following code:

import dask.array as da
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split

print('Local dask.__version__: {}'.format(dask.__version__))
print('Local dask_ml.__version__: {}'.format(dask_ml.__version__))
print('Client dask.__version__: {}'.format(dask.delayed(dask.__version__).compute()))
print('Client dask_ml.__version__: {}'.format(dask.delayed(dask_ml.__version__).compute()))

Local dask.version: 0.16.1 Local dask_ml.version: 0.6.0 Client dask.version: 0.16.1 Client dask_ml.version: 0.6.0

X, y = make_regression(n_samples=10000, n_features=4, random_state=0, chunks=4)
X

dask.array<da.random.normal, shape=(10000, 4), dtype=float64, chunksize=(4, 4)>

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-bd652d57a653> in <module>()
----> 1 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in train_test_split(*arrays, **options)
    276                             train_size=train_size, blockwise=blockwise,
    277                             random_state=random_state)
--> 278     train_idx, test_idx = next(splitter.split(*arrays))
    279 
    280     train_test_pairs = ((_blockwise_slice(arr, train_idx),

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in split(self, X, y, groups)
    137         for i in range(self.n_splits):
    138             if self.blockwise:
--> 139                 yield self._split_blockwise(X)
    140             else:
    141                 yield self._split(X)

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in _split_blockwise(self, X)
    144         chunks = X.chunks[0]
    145         rng = check_random_state(self.random_state)
--> 146         seeds = rng.randint(0, 2**32 - 1, size=len(chunks))
    147 
    148         train_pct, test_pct = _maybe_normalize_split_sizes(self.train_size,

mtrand.pyx in mtrand.RandomState.randint()

ValueError: high is out of bounds for int32

Issue Analytics

State:
Created 5 years ago
Comments:16 (8 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Jun 27, 2018

Thanks for the report. I assume your Python is 32 bit? We don’t do any testing with 32-bit builds.

Anyway, the bug is that

seeds = rng.randint(0, 2**32 - 1, size=len(chunks))

should be

seeds = rng.randint(0, 2**32 - 1, size=len(chunks), dtype='u8')

Any interest in making a PR to fix it? Otherwise I’ll get to it later today or tomorrow.

0reactions

TomAugspurgercommented, Mar 11, 2019

Ah, sorry I misread your earlier comment.

Dask.array’s RandomState.randint doesn’t support dtype yet. Opened https://github.com/dask/dask/issues/4579 to track that.

On Mon, Mar 11, 2019 at 10:55 AM André Christoffer Andersen < notifications@github.com> wrote:

I just ran

import dask_ml dask_ml.version

It reports ‘0.12.0’

The issue isn’t that dtype isn’t passed into draw_seed(…), it is that draw_seed(…) doesn’t pass a dtype into random_state.randint(…): https://github.com/dask/dask-ml/blob/d68c7fa9dbfac2b037a0caf5b4feb4940efa6a2c/dask_ml/_utils.py#L17

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/230#issuecomment-471601004, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIl1OABz0R8YH23X5zc0EdDQ-Bk1oks5vVnx9gaJpZM4U5SCy .

Top Results From Across the Web

scikit-learn random state in splitting dataset - Stack Overflow

Internally, the train_test_split() function uses a seed that allows you to pseudorandomly separate the data into two groups: training and test set. The...

Reproducible ML: Maybe you shouldn't be using Sklearn's ...

In this post I will demonstrate that the train_test_split function is more sensitive than you might think, and explain why using a random...

sklearn.model_selection.train_test_split

Split arrays or matrices into random train and test subsets. Quick utility that wraps input validation, next(ShuffleSplit().split(X, y)) , and application to ...

Split Your Dataset With scikit-learn's train_test_split()

The samples of the dataset are shuffled randomly and then split into the training and test sets according to the size you defined....

random_state in Machine Learning | Data Science ... - Kaggle

Random_state is used to set the seed for the random generator so that we can ensure that the results that we get can...