Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Arrow: large memory usage, error when opening files

See original GitHub issue

I’m trying to open a rather large (14 GB) Arrow IPC stream file:

>>> import vaex
df = vaex.open("of.arrow")

# Python is now using 5-6 GB RAM

>>> df.head()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 3431, in head
    return self[:min(n, len(self))]
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 4604, in __getitem__
    df = self.trim()
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 3839, in trim
    df = self if inplace else self.copy()
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 5011, in copy
    df.add_column(name, column, dtype=self._dtypes_override.get(name))
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 6019, in add_column
    super(DataFrameArrays, self).add_column(name, data, dtype=dtype)
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 2928, in add_column
    raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original()))
ValueError: array is of length 206, while the length of the DataFrame is 5627352

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

maartenbreddelscommented, Sep 15, 2021

feather can have compressed data, and with the current implementation of how we read feather, it will decompress into memory. You could try saving to IPC arrow format, or hdf5 instead.

1reaction

maartenbreddelscommented, Jun 3, 2020

The memory usage is odd, you could try #517 if you feel like living on the edge. The next major version (or maybe sooner) will include this branch.

Top Results From Across the Web

issue with memory usage with arrow package in R

I try to use arrow as a package developed for manipulations with data over the RAM size. After reading the csv-file with read_csv_arrow...

memory consumption question #2874 - apache/arrow - GitHub

This takes a very long time and it consumes around 50 GB memory(using top command to check memory used) and sometimes fails with...

[Python] Why does reading an arrow file cause almost double ...

(Note that to minimize the memory usage, > you should also pass use_threads=False. In that case, the maximum memory > overhead should be...

Memory Management — Apache Arrow v10.0.1

Arrow provides a tree-based model for memory allocation. The RootAllocator is created first, then more allocators are created as children of an existing ......

Excel 2016 not opening few xlsx files - There is not enough ...

If the error only appear to specific files, I'd recommend you right click the file you want to open, select "Properties", then click...