Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MaxRowsError for pandas.df with > 5000 rows

See original GitHub issue

Hey,

Thanks for the package, I’m very keen to try it out on my own data. When trying to create a simple histogram with my own data, VegaLite fails on dataframes with more than 5000 rows. Here’s a minimal reproducible example:

import altair as alt
import os
import pandas as pd
import numpy as np

lengths = np.random.randint(0,2000,6000)
lengths_list = lengths.tolist()
labels = [str(i) for i in lengths_list]
peak_lengths = pd.DataFrame.from_dict({'coords': labels, 'length': lengths_list},orient='columns')
alt.Chart(peak_lengths).mark_bar().encode(alt.X('lengths:Q', bin=True),y='count(*):Q')

Here’s the error:

---------------------------------------------------------------------------
MaxRowsError                              Traceback (most recent call last)
~/anaconda/envs/py3/lib/python3.5/site-packages/altair/vegalite/v2/api.py in to_dict(self, *args, **kwargs)
    259         copy = self.copy()
    260         original_data = getattr(copy, 'data', Undefined)
--> 261         copy._prepare_data()
    262 
    263         # We make use of two context markers:

~/anaconda/envs/py3/lib/python3.5/site-packages/altair/vegalite/v2/api.py in _prepare_data(self)
    251             pass
    252         elif isinstance(self.data, pd.DataFrame):
--> 253             self.data = pipe(self.data, data_transformers.get())
    254         elif isinstance(self.data, six.string_types):
    255             self.data = core.UrlData(self.data)

~/anaconda/envs/py3/lib/python3.5/site-packages/toolz/functoolz.py in pipe(data, *funcs)
    550     """
    551     for func in funcs:
--> 552         data = func(data)
    553     return data
    554 

~/anaconda/envs/py3/lib/python3.5/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    281     def __call__(self, *args, **kwargs):
    282         try:
--> 283             return self._partial(*args, **kwargs)
    284         except TypeError as exc:
    285             if self._should_curry(args, kwargs, exc):

~/anaconda/envs/py3/lib/python3.5/site-packages/altair/vegalite/data.py in default_data_transformer(data)
    122 @curry
    123 def default_data_transformer(data):
--> 124     return pipe(data, limit_rows, to_values)
    125 
    126 

~/anaconda/envs/py3/lib/python3.5/site-packages/toolz/functoolz.py in pipe(data, *funcs)
    550     """
    551     for func in funcs:
--> 552         data = func(data)
    553     return data
    554 

~/anaconda/envs/py3/lib/python3.5/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    281     def __call__(self, *args, **kwargs):
    282         try:
--> 283             return self._partial(*args, **kwargs)
    284         except TypeError as exc:
    285             if self._should_curry(args, kwargs, exc):

~/anaconda/envs/py3/lib/python3.5/site-packages/altair/vegalite/data.py in limit_rows(data, max_rows)
     47             return data
     48     if len(values) > max_rows:
---> 49         raise MaxRowsError('The number of rows in your dataset is greater than the max of {}'.format(max_rows))
     50     return data
     51 

MaxRowsError: The number of rows in your dataset is greater than the max of 5000

A quick issues search didn’t turn up any hits for MaxRowsError. There is a related issue (#287), but this was a data set with > 300k rows, and I have a measly 35k. Also, the FAQ link referenced in that issue now turns up a 404. For the meantime, does the advice in #249 still apply?

Package info: Running on Altair 2.0.0rc1, JupyterLab 0.31.12-py35_1 conda-forge

Issue Analytics

State:
Created 6 years ago
Comments:39 (21 by maintainers)

Top GitHub Comments

3reactions

palewirecommented, Mar 23, 2018

+1 to documenting how to workaround these limits and including a link to the docs in the error message. I think that could go a long way to addressing newbie frustrations.

2reactions

jakevdpcommented, Mar 28, 2018

I just opened #672, which would allow users to run alt.data_transformers.enable('no_max_rows') and then be able to embed arbitrarily large datasets in the notebook, if that is what they wish to do.

@ellisonbg, do you think we should offer that option?