train_test_split fails for too many values (32bit only)
See original GitHub issueConsider the following code:
import numpy as np
from sklearn.model_selection import train_test_split
n = 10000
y = np.random.randint(0, 2, size=n)
y_train, y_test = train_test_split(y, train_size=int(n/2),
test_size=int(n/2), stratify=y, random_state=123)
print('num train: {}'.format(len(y_train)))
print('train mean: {}'.format(y_train.mean()))
print('num test: {}'.format(len(y_test)))
print('test mean: {}'.format(y_test.mean()))
When n=10,000, I correctly obtain:
num train: 5000
train mean: 0.4958
num test: 5000
test mean: 0.4958
But for larger n, such as n=100,000, I get the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/home/scott/Development/scratch/sklearn/stratify.py in <module>()
6 y = np.random.randint(0, 2, size=n)
7
----> 8 y_train, y_test = train_test_split(y, train_size=n/2, test_size=n/2, stratify=y, random_state=123)
9
10 print 'num train: {}'.format(len(y_train))
/home/scott/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in train_test_split(*arrays, **options)
1700 train, test = next(cv.split(X=arrays[0], y=stratify))
1701 return list(chain.from_iterable((safe_indexing(a, train),
-> 1702 safe_indexing(a, test)) for a in arrays))
1703
1704
/home/scott/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in <genexpr>((a,))
1700 train, test = next(cv.split(X=arrays[0], y=stratify))
1701 return list(chain.from_iterable((safe_indexing(a, train),
-> 1702 safe_indexing(a, test)) for a in arrays))
1703
1704
/home/scott/anaconda/lib/python2.7/site-packages/sklearn/utils/__init__.pyc in safe_indexing(X, indices)
110 return X.take(indices, axis=0)
111 else:
--> 112 return X[indices]
113 else:
114 return [X[idx] for idx in indices]
IndexError: arrays used as indices must be of integer (or boolean) type
And for n=1,000,000, I don’t get an exception, instead the strange results:
num train: 1785
train mean: 0.414565826331
num test: 894
test mean: 0.414988814318
Why is this? Is this a bug? Does train_test_split fail with too many values?
Issue Analytics
- State:
- Created 6 years ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
too many values to unpack when trying to use train_test_split ...
I'm getting this error while splitting my .csv mnist dataset. What am I doing wrong? import tensorflow as tf from tensorflow import keras ......
Read more >train_test_split ValueError: Input contains NaN
This error message is generally pretty straightforward: you have missing values (generally one of np.nan , pd.NA , None ), and whatever method ......
Read more >3.6. scikit-learn: machine learning in Python
The model has been learned from the training data, and can be used to predict the result of test data: here, we might...
Read more >How to Avoid Data Leakage When Performing Data Preparation
More generally, the entire modeling pipeline must be prepared only on the training dataset to avoid data leakage. This might include data ...
Read more >When Prediction Fails — Causal Inference for the Brave and ...
Notice that if the coupon value is too high, you will probably lose money, since customers will buy all they need using only...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hello! Your code returns correct results without any errors on my system. I’m using: python 2.7.12 sklearn 0.18.1 numpy 1.11.3
Could it be an issue in
maybe converting class_counts to float will fix it? (this code is no fun)