memory error working with big datasets
See original GitHub issueHi! Just started using altair today- I’m trying to make a pretty simple histogram on a large dataset using:
alt.Chart(df).mark_bar().encode(
x=alt.X('proba:Q', bin=True),
y='count()',
)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
~/repo/.venv/lib/python3.7/site-packages/altair/vegalite/v3/api.py in to_dict(self, *args, **kwargs)
363 copy = self.copy(deep=False)
364 original_data = getattr(copy, 'data', Undefined)
--> 365 copy.data = _prepare_data(original_data, context)
366
367 if original_data is not Undefined:
~/repo/.venv/lib/python3.7/site-packages/altair/vegalite/v3/api.py in _prepare_data(data, context)
82 # convert dataframes to dict
83 if isinstance(data, pd.DataFrame):
---> 84 data = pipe(data, data_transformers.get())
85
86 # convert string input to a URLData
~/repo/.venv/lib/python3.7/site-packages/toolz/functoolz.py in pipe(data, *funcs)
632 """
633 for func in funcs:
--> 634 data = func(data)
635 return data
636
~/repo/.venv/lib/python3.7/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
301 def __call__(self, *args, **kwargs):
302 try:
--> 303 return self._partial(*args, **kwargs)
304 except TypeError as exc:
305 if self._should_curry(args, kwargs, exc):
~/repo/.venv/lib/python3.7/site-packages/altair/utils/data.py in to_json(data, prefix, extension, filename)
94 Write the data model to a .json file and return a url based data model.
95 """
---> 96 data_json = _data_to_json_string(data)
97 data_hash = _compute_data_hash(data_json)
98 filename = filename.format(prefix=prefix, hash=data_hash,
~/repo/.venv/lib/python3.7/site-packages/altair/utils/data.py in _data_to_json_string(data)
153 check_data_type(data)
154 if isinstance(data, pd.DataFrame):
--> 155 data = sanitize_dataframe(data)
156 return data.to_json(orient='records')
157 elif isinstance(data, dict):
~/repo/.venv/lib/python3.7/site-packages/altair/utils/core.py in sanitize_dataframe(df)
175 col = df[col_name]
176 bad_values = col.isnull() | np.isinf(col)
--> 177 df[col_name] = col.astype(object).where(~bad_values, None)
178 elif dtype == object:
179 # Convert numpy arrays saved as objects to lists
~/repo/.venv/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3465 else:
3466 # set column
-> 3467 self._set_item(key, value)
3468
3469 def _setitem_slice(self, key, value):
~/repo/.venv/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
3543 self._ensure_valid_index(value)
3544 value = self._sanitize_column(key, value)
-> 3545 NDFrame._set_item(self, key, value)
3546
3547 # check if we are modifying a copy
~/repo/.venv/lib/python3.7/site-packages/pandas/core/generic.py in _set_item(self, key, value)
3380
3381 def _set_item(self, key, value):
-> 3382 self._data.set(key, value)
3383 self._clear_item_cache()
3384
~/repo/.venv/lib/python3.7/site-packages/pandas/core/internals/managers.py in set(self, item, value)
1097 else:
1098 self._blklocs[blk.mgr_locs.indexer] = -1
-> 1099 blk.delete(blk_locs)
1100 self._blklocs[blk.mgr_locs.indexer] = np.arange(len(blk))
1101
~/repo/.venv/lib/python3.7/site-packages/pandas/core/internals/blocks.py in delete(self, loc)
382 Delete given loc(-s) from block in-place.
383 """
--> 384 self.values = np.delete(self.values, loc, 0)
385 self.mgr_locs = self.mgr_locs.delete(loc)
386
<__array_function__ internals> in delete(*args, **kwargs)
~/repo/.venv/lib/python3.7/site-packages/numpy/lib/function_base.py in delete(arr, obj, axis)
4422 keep[obj, ] = False
4423 slobj[axis] = keep
-> 4424 new = arr[tuple(slobj)]
4425
4426 if wrap:
MemoryError: Unable to allocate array with shape (375, 24822196) and data type float64
The dimensions of df
are:
df.shape
: (24822196, 378)- dtypes, almost all
float64
except for two ID columns df.memory_usage(deep=True).sum() / 1024 ** 3
: 73.10018550418317 GB
Is there a way to disable sanitize_dataframe
? It seems like that is creating a very wide dataframe even though I just need to access that one column.
Or do you think I’m using the wrong library for this size of data? I’m using alt.data_transformers.enable('json')
.
For reference, if I just run df.proba.hist()
it about 689
milliseconds to create a histogram using pandas/matplotlib.
Thanks!
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Memory Error : Training large dataset - python - Stack Overflow
Your problem is that you try to load all the data at once, and it is much larger than your RAM. You need...
Read more >What to Do When Your Data Is Too Big for Your Memory?
Another way to handle large datasets is by chunking them. That is cutting a large dataset into smaller chunks and then processing those...
Read more >Memory Error with Very Large Dataset · Issue #126 - GitHub
Dear team, I am in stuck when convert very large numpy array to your TSDatasets. These are what I have tried to fix...
Read more >How to Solve the Python Memory Error - HackerNoon
When you get this error, it means you've loaded all of the data into memory. Batch processing is recommended for big datasets. Rather...
Read more >Complete Guide to Python Memory Error - eduCBA
The most important case for Memory Error in python is one that occurs during the use of large datasets. Upon working on Machine...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You should not expect to be able to use Altair with data that has 24,000,000 entries. My rule of thumb is that anything more than about 10,000 rows is too large. I’d suggest using pandas, matplotlib, or another plotting platform that aggregates data in the client rather than in the renderer.
Thank you, that is great to hear! It’s probably long off but it’d be great to use altair for bigger data sets as well.