question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Arrow: large memory usage, error when opening files

See original GitHub issue

I’m trying to open a rather large (14 GB) Arrow IPC stream file:

>>> import vaex
df = vaex.open("of.arrow")

# Python is now using 5-6 GB RAM

>>> df.head()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 3431, in head
    return self[:min(n, len(self))]
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 4604, in __getitem__
    df = self.trim()
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 3839, in trim
    df = self if inplace else self.copy()
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 5011, in copy
    df.add_column(name, column, dtype=self._dtypes_override.get(name))
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 6019, in add_column
    super(DataFrameArrays, self).add_column(name, data, dtype=dtype)
  File "/home/me/.local/lib/python3.8/site-packages/vaex/dataframe.py", line 2928, in add_column
    raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original()))
ValueError: array is of length 206, while the length of the DataFrame is 5627352

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
maartenbreddelscommented, Sep 15, 2021

feather can have compressed data, and with the current implementation of how we read feather, it will decompress into memory. You could try saving to IPC arrow format, or hdf5 instead.

1reaction
maartenbreddelscommented, Jun 3, 2020

The memory usage is odd, you could try #517 if you feel like living on the edge. The next major version (or maybe sooner) will include this branch.

Read more comments on GitHub >

github_iconTop Results From Across the Web

issue with memory usage with arrow package in R
I try to use arrow as a package developed for manipulations with data over the RAM size. After reading the csv-file with read_csv_arrow...
Read more >
memory consumption question #2874 - apache/arrow - GitHub
This takes a very long time and it consumes around 50 GB memory(using top command to check memory used) and sometimes fails with...
Read more >
[Python] Why does reading an arrow file cause almost double ...
(Note that to minimize the memory usage, > you should also pass use_threads=False. In that case, the maximum memory > overhead should be...
Read more >
Memory Management — Apache Arrow v10.0.1
Arrow provides a tree-based model for memory allocation. The RootAllocator is created first, then more allocators are created as children of an existing ......
Read more >
Excel 2016 not opening few xlsx files - There is not enough ...
If the error only appear to specific files, I'd recommend you right click the file you want to open, select "Properties", then click...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found