Error of random seed when using train_test_split()
See original GitHub issueI get this error when executing the following code:
import dask.array as da
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split
print('Local dask.__version__: {}'.format(dask.__version__))
print('Local dask_ml.__version__: {}'.format(dask_ml.__version__))
print('Client dask.__version__: {}'.format(dask.delayed(dask.__version__).compute()))
print('Client dask_ml.__version__: {}'.format(dask.delayed(dask_ml.__version__).compute()))
Local dask.version: 0.16.1 Local dask_ml.version: 0.6.0 Client dask.version: 0.16.1 Client dask_ml.version: 0.6.0
X, y = make_regression(n_samples=10000, n_features=4, random_state=0, chunks=4)
X
dask.array<da.random.normal, shape=(10000, 4), dtype=float64, chunksize=(4, 4)>
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-51-bd652d57a653> in <module>()
----> 1 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in train_test_split(*arrays, **options)
276 train_size=train_size, blockwise=blockwise,
277 random_state=random_state)
--> 278 train_idx, test_idx = next(splitter.split(*arrays))
279
280 train_test_pairs = ((_blockwise_slice(arr, train_idx),
~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in split(self, X, y, groups)
137 for i in range(self.n_splits):
138 if self.blockwise:
--> 139 yield self._split_blockwise(X)
140 else:
141 yield self._split(X)
~\AppData\Local\Continuum\anaconda3\envs\py36_all\lib\site-packages\dask_ml\model_selection\_split.py in _split_blockwise(self, X)
144 chunks = X.chunks[0]
145 rng = check_random_state(self.random_state)
--> 146 seeds = rng.randint(0, 2**32 - 1, size=len(chunks))
147
148 train_pct, test_pct = _maybe_normalize_split_sizes(self.train_size,
mtrand.pyx in mtrand.RandomState.randint()
ValueError: high is out of bounds for int32
Issue Analytics
- State:
- Created 5 years ago
- Comments:16 (8 by maintainers)
Top Results From Across the Web
scikit-learn random state in splitting dataset - Stack Overflow
Internally, the train_test_split() function uses a seed that allows you to pseudorandomly separate the data into two groups: training and test set. The...
Read more >Reproducible ML: Maybe you shouldn't be using Sklearn's ...
In this post I will demonstrate that the train_test_split function is more sensitive than you might think, and explain why using a random...
Read more >sklearn.model_selection.train_test_split
Split arrays or matrices into random train and test subsets. Quick utility that wraps input validation, next(ShuffleSplit().split(X, y)) , and application to ...
Read more >Split Your Dataset With scikit-learn's train_test_split()
The samples of the dataset are shuffled randomly and then split into the training and test sets according to the size you defined....
Read more >random_state in Machine Learning | Data Science ... - Kaggle
Random_state is used to set the seed for the random generator so that we can ensure that the results that we get can...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for the report. I assume your Python is 32 bit? We don’t do any testing with 32-bit builds.
Anyway, the bug is that
seeds = rng.randint(0, 2**32 - 1, size=len(chunks))
should be
Any interest in making a PR to fix it? Otherwise I’ll get to it later today or tomorrow.
Ah, sorry I misread your earlier comment.
Dask.array’s RandomState.randint doesn’t support dtype yet. Opened https://github.com/dask/dask/issues/4579 to track that.
On Mon, Mar 11, 2019 at 10:55 AM André Christoffer Andersen < notifications@github.com> wrote: