Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

train_test_split fails for too many values (32bit only)

See original GitHub issue

Consider the following code:

import numpy as np
from sklearn.model_selection import train_test_split

n = 10000
y = np.random.randint(0, 2, size=n)

y_train, y_test = train_test_split(y, train_size=int(n/2),
                                   test_size=int(n/2), stratify=y, random_state=123)

print('num train: {}'.format(len(y_train)))
print('train mean: {}'.format(y_train.mean()))
print('num test: {}'.format(len(y_test)))
print('test mean: {}'.format(y_test.mean()))

When n=10,000, I correctly obtain:

num train: 5000
train mean: 0.4958
num test: 5000
test mean: 0.4958

But for larger n, such as n=100,000, I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/home/scott/Development/scratch/sklearn/stratify.py in <module>()
      6 y = np.random.randint(0, 2, size=n)
      7 
----> 8 y_train, y_test = train_test_split(y, train_size=n/2, test_size=n/2, stratify=y, random_state=123)
      9 
     10 print 'num train: {}'.format(len(y_train))

/home/scott/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in train_test_split(*arrays, **options)
   1700     train, test = next(cv.split(X=arrays[0], y=stratify))
   1701     return list(chain.from_iterable((safe_indexing(a, train),
-> 1702                                      safe_indexing(a, test)) for a in arrays))
   1703 
   1704 

/home/scott/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in <genexpr>((a,))
   1700     train, test = next(cv.split(X=arrays[0], y=stratify))
   1701     return list(chain.from_iterable((safe_indexing(a, train),
-> 1702                                      safe_indexing(a, test)) for a in arrays))
   1703 
   1704 

/home/scott/anaconda/lib/python2.7/site-packages/sklearn/utils/__init__.pyc in safe_indexing(X, indices)
    110             return X.take(indices, axis=0)
    111         else:
--> 112             return X[indices]
    113     else:
    114         return [X[idx] for idx in indices]

IndexError: arrays used as indices must be of integer (or boolean) type

And for n=1,000,000, I don’t get an exception, instead the strange results:

num train: 1785
train mean: 0.414565826331
num test: 894
test mean: 0.414988814318

Why is this? Is this a bug? Does train_test_split fail with too many values?

Issue Analytics

State:
Created 6 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

aqua4commented, Apr 18, 2017

Hello! Your code returns correct results without any errors on my system. I’m using: python 2.7.12 sklearn 0.18.1 numpy 1.11.3

0reactions

amuellercommented, May 21, 2018

Could it be an issue in

        class_indices = np.split(np.argsort(y_indices, kind='mergesort'),
                                 np.cumsum(class_counts)[:-1])

maybe converting class_counts to float will fix it? (this code is no fun)

Top Results From Across the Web

too many values to unpack when trying to use train_test_split ...

I'm getting this error while splitting my .csv mnist dataset. What am I doing wrong? import tensorflow as tf from tensorflow import keras ......

train_test_split ValueError: Input contains NaN

This error message is generally pretty straightforward: you have missing values (generally one of np.nan , pd.NA , None ), and whatever method ......

3.6. scikit-learn: machine learning in Python

The model has been learned from the training data, and can be used to predict the result of test data: here, we might...

How to Avoid Data Leakage When Performing Data Preparation

More generally, the entire modeling pipeline must be prepared only on the training dataset to avoid data leakage. This might include data ...

When Prediction Fails — Causal Inference for the Brave and ...

Notice that if the coupon value is too high, you will probably lose money, since customers will buy all they need using only...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

train_test_split fails for too many values (32bit only)

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

export_graphviz should fail with a better error message if a not-fitted decision tree is provided.

For probabilistic scorers, LogisticRegressionCV(multi_class='multinomial') uses OvR to calculate scores