question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

train_test_split fails for too many values (32bit only)

See original GitHub issue

Consider the following code:

import numpy as np
from sklearn.model_selection import train_test_split

n = 10000
y = np.random.randint(0, 2, size=n)

y_train, y_test = train_test_split(y, train_size=int(n/2),
                                   test_size=int(n/2), stratify=y, random_state=123)

print('num train: {}'.format(len(y_train)))
print('train mean: {}'.format(y_train.mean()))
print('num test: {}'.format(len(y_test)))
print('test mean: {}'.format(y_test.mean()))

When n=10,000, I correctly obtain:

num train: 5000
train mean: 0.4958
num test: 5000
test mean: 0.4958

But for larger n, such as n=100,000, I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/home/scott/Development/scratch/sklearn/stratify.py in <module>()
      6 y = np.random.randint(0, 2, size=n)
      7 
----> 8 y_train, y_test = train_test_split(y, train_size=n/2, test_size=n/2, stratify=y, random_state=123)
      9 
     10 print 'num train: {}'.format(len(y_train))

/home/scott/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in train_test_split(*arrays, **options)
   1700     train, test = next(cv.split(X=arrays[0], y=stratify))
   1701     return list(chain.from_iterable((safe_indexing(a, train),
-> 1702                                      safe_indexing(a, test)) for a in arrays))
   1703 
   1704 

/home/scott/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in <genexpr>((a,))
   1700     train, test = next(cv.split(X=arrays[0], y=stratify))
   1701     return list(chain.from_iterable((safe_indexing(a, train),
-> 1702                                      safe_indexing(a, test)) for a in arrays))
   1703 
   1704 

/home/scott/anaconda/lib/python2.7/site-packages/sklearn/utils/__init__.pyc in safe_indexing(X, indices)
    110             return X.take(indices, axis=0)
    111         else:
--> 112             return X[indices]
    113     else:
    114         return [X[idx] for idx in indices]

IndexError: arrays used as indices must be of integer (or boolean) type

And for n=1,000,000, I don’t get an exception, instead the strange results:

num train: 1785
train mean: 0.414565826331
num test: 894
test mean: 0.414988814318

Why is this? Is this a bug? Does train_test_split fail with too many values?

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
aqua4commented, Apr 18, 2017

Hello! Your code returns correct results without any errors on my system. I’m using: python 2.7.12 sklearn 0.18.1 numpy 1.11.3

0reactions
amuellercommented, May 21, 2018

Could it be an issue in

        class_indices = np.split(np.argsort(y_indices, kind='mergesort'),
                                 np.cumsum(class_counts)[:-1])

maybe converting class_counts to float will fix it? (this code is no fun)

Read more comments on GitHub >

github_iconTop Results From Across the Web

too many values to unpack when trying to use train_test_split ...
I'm getting this error while splitting my .csv mnist dataset. What am I doing wrong? import tensorflow as tf from tensorflow import keras ......
Read more >
train_test_split ValueError: Input contains NaN
This error message is generally pretty straightforward: you have missing values (generally one of np.nan , pd.NA , None ), and whatever method ......
Read more >
3.6. scikit-learn: machine learning in Python
The model has been learned from the training data, and can be used to predict the result of test data: here, we might...
Read more >
How to Avoid Data Leakage When Performing Data Preparation
More generally, the entire modeling pipeline must be prepared only on the training dataset to avoid data leakage. This might include data ...
Read more >
When Prediction Fails — Causal Inference for the Brave and ...
Notice that if the coupon value is too high, you will probably lose money, since customers will buy all they need using only...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found