Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory error working with big datasets

See original GitHub issue

Hi! Just started using altair today- I’m trying to make a pretty simple histogram on a large dataset using:

alt.Chart(df).mark_bar().encode(
    x=alt.X('proba:Q', bin=True),
    y='count()',
)

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
~/repo/.venv/lib/python3.7/site-packages/altair/vegalite/v3/api.py in to_dict(self, *args, **kwargs)
    363         copy = self.copy(deep=False)
    364         original_data = getattr(copy, 'data', Undefined)
--> 365         copy.data = _prepare_data(original_data, context)
    366 
    367         if original_data is not Undefined:

~/repo/.venv/lib/python3.7/site-packages/altair/vegalite/v3/api.py in _prepare_data(data, context)
     82     # convert dataframes to dict
     83     if isinstance(data, pd.DataFrame):
---> 84         data = pipe(data, data_transformers.get())
     85 
     86     # convert string input to a URLData

~/repo/.venv/lib/python3.7/site-packages/toolz/functoolz.py in pipe(data, *funcs)
    632     """
    633     for func in funcs:
--> 634         data = func(data)
    635     return data
    636 

~/repo/.venv/lib/python3.7/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    301     def __call__(self, *args, **kwargs):
    302         try:
--> 303             return self._partial(*args, **kwargs)
    304         except TypeError as exc:
    305             if self._should_curry(args, kwargs, exc):

~/repo/.venv/lib/python3.7/site-packages/altair/utils/data.py in to_json(data, prefix, extension, filename)
     94     Write the data model to a .json file and return a url based data model.
     95     """
---> 96     data_json = _data_to_json_string(data)
     97     data_hash = _compute_data_hash(data_json)
     98     filename = filename.format(prefix=prefix, hash=data_hash,

~/repo/.venv/lib/python3.7/site-packages/altair/utils/data.py in _data_to_json_string(data)
    153     check_data_type(data)
    154     if isinstance(data, pd.DataFrame):
--> 155         data = sanitize_dataframe(data)
    156         return data.to_json(orient='records')
    157     elif isinstance(data, dict):

~/repo/.venv/lib/python3.7/site-packages/altair/utils/core.py in sanitize_dataframe(df)
    175             col = df[col_name]
    176             bad_values = col.isnull() | np.isinf(col)
--> 177             df[col_name] = col.astype(object).where(~bad_values, None)
    178         elif dtype == object:
    179             # Convert numpy arrays saved as objects to lists

~/repo/.venv/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3465         else:
   3466             # set column
-> 3467             self._set_item(key, value)
   3468 
   3469     def _setitem_slice(self, key, value):

~/repo/.venv/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3543         self._ensure_valid_index(value)
   3544         value = self._sanitize_column(key, value)
-> 3545         NDFrame._set_item(self, key, value)
   3546 
   3547         # check if we are modifying a copy

~/repo/.venv/lib/python3.7/site-packages/pandas/core/generic.py in _set_item(self, key, value)
   3380 
   3381     def _set_item(self, key, value):
-> 3382         self._data.set(key, value)
   3383         self._clear_item_cache()
   3384 

~/repo/.venv/lib/python3.7/site-packages/pandas/core/internals/managers.py in set(self, item, value)
   1097                 else:
   1098                     self._blklocs[blk.mgr_locs.indexer] = -1
-> 1099                     blk.delete(blk_locs)
   1100                     self._blklocs[blk.mgr_locs.indexer] = np.arange(len(blk))
   1101 

~/repo/.venv/lib/python3.7/site-packages/pandas/core/internals/blocks.py in delete(self, loc)
    382         Delete given loc(-s) from block in-place.
    383         """
--> 384         self.values = np.delete(self.values, loc, 0)
    385         self.mgr_locs = self.mgr_locs.delete(loc)
    386 

<__array_function__ internals> in delete(*args, **kwargs)

~/repo/.venv/lib/python3.7/site-packages/numpy/lib/function_base.py in delete(arr, obj, axis)
   4422         keep[obj, ] = False
   4423         slobj[axis] = keep
-> 4424         new = arr[tuple(slobj)]
   4425 
   4426     if wrap:

MemoryError: Unable to allocate array with shape (375, 24822196) and data type float64

The dimensions of df are:

df.shape: (24822196, 378)
dtypes, almost all float64 except for two ID columns
df.memory_usage(deep=True).sum() / 1024 ** 3: 73.10018550418317 GB

Is there a way to disable sanitize_dataframe? It seems like that is creating a very wide dataframe even though I just need to access that one column.

Or do you think I’m using the wrong library for this size of data? I’m using alt.data_transformers.enable('json').

For reference, if I just run df.proba.hist() it about 689 milliseconds to create a histogram using pandas/matplotlib.

Thanks!

Issue Analytics

State:
Created 4 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

jakevdpcommented, Aug 20, 2019

You should not expect to be able to use Altair with data that has 24,000,000 entries. My rule of thumb is that anything more than about 10,000 rows is too large. I’d suggest using pandas, matplotlib, or another plotting platform that aggregates data in the client rather than in the renderer.

0reactions

kaisengitcommented, Aug 20, 2019

Thank you, that is great to hear! It’s probably long off but it’d be great to use altair for bigger data sets as well.