question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error of random seed when using train_test_split()

See original GitHub issue

I get this error when executing the following code:

import dask.array as da
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split

print('Local dask.__version__: {}'.format(dask.__version__))
print('Local dask_ml.__version__: {}'.format(dask_ml.__version__))
print('Client dask.__version__: {}'.format(dask.delayed(dask.__version__).compute()))
print('Client dask_ml.__version__: {}'.format(dask.delayed(dask_ml.__version__).compute()))

Local dask.version: 0.16.1 Local dask_ml.version: 0.6.0 Client dask.version: 0.16.1 Client dask_ml.version: 0.6.0

X, y = make_regression(n_samples=10000, n_features=4, random_state=0, chunks=4)
X

dask.array<da.random.normal, shape=(10000, 4), dtype=float64, chunksize=(4, 4)>

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-bd652d57a653> in <module>()
----> 1 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in train_test_split(*arrays, **options)
    276                             train_size=train_size, blockwise=blockwise,
    277                             random_state=random_state)
--> 278     train_idx, test_idx = next(splitter.split(*arrays))
    279 
    280     train_test_pairs = ((_blockwise_slice(arr, train_idx),

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in split(self, X, y, groups)
    137         for i in range(self.n_splits):
    138             if self.blockwise:
--> 139                 yield self._split_blockwise(X)
    140             else:
    141                 yield self._split(X)

~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in _split_blockwise(self, X)
    144         chunks = X.chunks[0]
    145         rng = check_random_state(self.random_state)
--> 146         seeds = rng.randint(0, 2**32 - 1, size=len(chunks))
    147 
    148         train_pct, test_pct = _maybe_normalize_split_sizes(self.train_size,

mtrand.pyx in mtrand.RandomState.randint()

ValueError: high is out of bounds for int32

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Jun 27, 2018

Thanks for the report. I assume your Python is 32 bit? We don’t do any testing with 32-bit builds.

Anyway, the bug is that

seeds = rng.randint(0, 2**32 - 1, size=len(chunks))

should be

seeds = rng.randint(0, 2**32 - 1, size=len(chunks), dtype='u8')

Any interest in making a PR to fix it? Otherwise I’ll get to it later today or tomorrow.

0reactions
TomAugspurgercommented, Mar 11, 2019

Ah, sorry I misread your earlier comment.

Dask.array’s RandomState.randint doesn’t support dtype yet. Opened https://github.com/dask/dask/issues/4579 to track that.

On Mon, Mar 11, 2019 at 10:55 AM André Christoffer Andersen < notifications@github.com> wrote:

I just ran

import dask_ml dask_ml.version

It reports ‘0.12.0’

The issue isn’t that dtype isn’t passed into draw_seed(…), it is that draw_seed(…) doesn’t pass a dtype into random_state.randint(…): https://github.com/dask/dask-ml/blob/d68c7fa9dbfac2b037a0caf5b4feb4940efa6a2c/dask_ml/_utils.py#L17

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/230#issuecomment-471601004, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIl1OABz0R8YH23X5zc0EdDQ-Bk1oks5vVnx9gaJpZM4U5SCy .

Read more comments on GitHub >

github_iconTop Results From Across the Web

scikit-learn random state in splitting dataset - Stack Overflow
Internally, the train_test_split() function uses a seed that allows you to pseudorandomly separate the data into two groups: training and test set. The...
Read more >
Reproducible ML: Maybe you shouldn't be using Sklearn's ...
In this post I will demonstrate that the train_test_split function is more sensitive than you might think, and explain why using a random...
Read more >
sklearn.model_selection.train_test_split
Split arrays or matrices into random train and test subsets. Quick utility that wraps input validation, next(ShuffleSplit().split(X, y)) , and application to ...
Read more >
Split Your Dataset With scikit-learn's train_test_split()
The samples of the dataset are shuffled randomly and then split into the training and test sets according to the size you defined....
Read more >
random_state in Machine Learning | Data Science ... - Kaggle
Random_state is used to set the seed for the random generator so that we can ensure that the results that we get can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found