question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory error working with big datasets

See original GitHub issue

Hi! Just started using altair today- I’m trying to make a pretty simple histogram on a large dataset using:

alt.Chart(df).mark_bar().encode(
    x=alt.X('proba:Q', bin=True),
    y='count()',
)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
~/repo/.venv/lib/python3.7/site-packages/altair/vegalite/v3/api.py in to_dict(self, *args, **kwargs)
    363         copy = self.copy(deep=False)
    364         original_data = getattr(copy, 'data', Undefined)
--> 365         copy.data = _prepare_data(original_data, context)
    366 
    367         if original_data is not Undefined:

~/repo/.venv/lib/python3.7/site-packages/altair/vegalite/v3/api.py in _prepare_data(data, context)
     82     # convert dataframes to dict
     83     if isinstance(data, pd.DataFrame):
---> 84         data = pipe(data, data_transformers.get())
     85 
     86     # convert string input to a URLData

~/repo/.venv/lib/python3.7/site-packages/toolz/functoolz.py in pipe(data, *funcs)
    632     """
    633     for func in funcs:
--> 634         data = func(data)
    635     return data
    636 

~/repo/.venv/lib/python3.7/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    301     def __call__(self, *args, **kwargs):
    302         try:
--> 303             return self._partial(*args, **kwargs)
    304         except TypeError as exc:
    305             if self._should_curry(args, kwargs, exc):

~/repo/.venv/lib/python3.7/site-packages/altair/utils/data.py in to_json(data, prefix, extension, filename)
     94     Write the data model to a .json file and return a url based data model.
     95     """
---> 96     data_json = _data_to_json_string(data)
     97     data_hash = _compute_data_hash(data_json)
     98     filename = filename.format(prefix=prefix, hash=data_hash,

~/repo/.venv/lib/python3.7/site-packages/altair/utils/data.py in _data_to_json_string(data)
    153     check_data_type(data)
    154     if isinstance(data, pd.DataFrame):
--> 155         data = sanitize_dataframe(data)
    156         return data.to_json(orient='records')
    157     elif isinstance(data, dict):

~/repo/.venv/lib/python3.7/site-packages/altair/utils/core.py in sanitize_dataframe(df)
    175             col = df[col_name]
    176             bad_values = col.isnull() | np.isinf(col)
--> 177             df[col_name] = col.astype(object).where(~bad_values, None)
    178         elif dtype == object:
    179             # Convert numpy arrays saved as objects to lists

~/repo/.venv/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3465         else:
   3466             # set column
-> 3467             self._set_item(key, value)
   3468 
   3469     def _setitem_slice(self, key, value):

~/repo/.venv/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3543         self._ensure_valid_index(value)
   3544         value = self._sanitize_column(key, value)
-> 3545         NDFrame._set_item(self, key, value)
   3546 
   3547         # check if we are modifying a copy

~/repo/.venv/lib/python3.7/site-packages/pandas/core/generic.py in _set_item(self, key, value)
   3380 
   3381     def _set_item(self, key, value):
-> 3382         self._data.set(key, value)
   3383         self._clear_item_cache()
   3384 

~/repo/.venv/lib/python3.7/site-packages/pandas/core/internals/managers.py in set(self, item, value)
   1097                 else:
   1098                     self._blklocs[blk.mgr_locs.indexer] = -1
-> 1099                     blk.delete(blk_locs)
   1100                     self._blklocs[blk.mgr_locs.indexer] = np.arange(len(blk))
   1101 

~/repo/.venv/lib/python3.7/site-packages/pandas/core/internals/blocks.py in delete(self, loc)
    382         Delete given loc(-s) from block in-place.
    383         """
--> 384         self.values = np.delete(self.values, loc, 0)
    385         self.mgr_locs = self.mgr_locs.delete(loc)
    386 

<__array_function__ internals> in delete(*args, **kwargs)

~/repo/.venv/lib/python3.7/site-packages/numpy/lib/function_base.py in delete(arr, obj, axis)
   4422         keep[obj, ] = False
   4423         slobj[axis] = keep
-> 4424         new = arr[tuple(slobj)]
   4425 
   4426     if wrap:

MemoryError: Unable to allocate array with shape (375, 24822196) and data type float64

The dimensions of df are:

  • df.shape: (24822196, 378)
  • dtypes, almost all float64 except for two ID columns
  • df.memory_usage(deep=True).sum() / 1024 ** 3: 73.10018550418317 GB

Is there a way to disable sanitize_dataframe? It seems like that is creating a very wide dataframe even though I just need to access that one column.

Or do you think I’m using the wrong library for this size of data? I’m using alt.data_transformers.enable('json').

For reference, if I just run df.proba.hist() it about 689 milliseconds to create a histogram using pandas/matplotlib.

Thanks!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jakevdpcommented, Aug 20, 2019

You should not expect to be able to use Altair with data that has 24,000,000 entries. My rule of thumb is that anything more than about 10,000 rows is too large. I’d suggest using pandas, matplotlib, or another plotting platform that aggregates data in the client rather than in the renderer.

0reactions
kaisengitcommented, Aug 20, 2019

Thank you, that is great to hear! It’s probably long off but it’d be great to use altair for bigger data sets as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory Error : Training large dataset - python - Stack Overflow
Your problem is that you try to load all the data at once, and it is much larger than your RAM. You need...
Read more >
What to Do When Your Data Is Too Big for Your Memory?
Another way to handle large datasets is by chunking them. That is cutting a large dataset into smaller chunks and then processing those...
Read more >
Memory Error with Very Large Dataset · Issue #126 - GitHub
Dear team, I am in stuck when convert very large numpy array to your TSDatasets. These are what I have tried to fix...
Read more >
How to Solve the Python Memory Error - HackerNoon
When you get this error, it means you've loaded all of the data into memory. Batch processing is recommended for big datasets. Rather...
Read more >
Complete Guide to Python Memory Error - eduCBA
The most important case for Memory Error in python is one that occurs during the use of large datasets. Upon working on Machine...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found