question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to reduce memory usage when loading large parquet dataset?

See original GitHub issue

Describe the bug I am trying to load a dataset of 200 parquet files (≈11GB in total) from s3 and convert it into a DataFrame. Note that I only use a small subset of columns so most of the data is redundant. When I download the data manually, load them one by one using pd.read_parquet, and merge them using pd.concat, the program uses ≈12GB of RAM. However, when I try doing the same using wrangler, I can see it using up to ≈30GB of RAM before it errors with

  File "/Users/jannikbertram/opt/miniconda3/envs/wit/lib/python3.8/site-packages/pyarrow/parquet.py", line 270, in read_row_group
    return self.reader.read_row_group(i, column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1079, in pyarrow._parquet.ParquetReader.read_row_group
  File "pyarrow/_parquet.pyx", line 1098, in pyarrow._parquet.ParquetReader.read_row_groups
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: IOError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer')). Detail: Python exception: ProtocolError

To Reproduce I installed wrangler using pip within a miniconda environment, python version 3.8.2 running on macOS Catalina.

dfs = wr.s3.read_parquet(
  s3_path_to_folder_with_parquet_files,
  dataset=True,
  columns=['column1', 'column2', 'column3'],
  chunked=True
)
df = pd.concat(dfs)

To reproduce, obviously, you also need a dataset of similar size.

Can somebody explain what’s happening here? Perhaps it’s just a mistake in my setup or this behaviour is intentional?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
igorborgestcommented, Aug 26, 2020

@janbe-ts

Based on this line in the code, I expected chunked=True to make it more memory-efficient. Thanks for pointing out that this is not the case in my setup 😃

I will try to improve this explanation to make things cleaner. But it will only help you to save memory if you process each chunk at once.

e.g.

for df in wr.s3.read_parquet(path, dataset=True, columns=["lat1"], chunked=True):
    # Do whatever do you want with df
    wr.s3.to_parquet(...)

Also, I wonder why the Activity Monitor shows me a completely different amount of memory used by the program

Hmm… I also don’t know the difference reason, but let me know if you find out.


P.S. If you want save memory, you should consider the categories argument, it helps in some cases.

0reactions
janbe-tscommented, Aug 26, 2020

Hi @igorborgest, thanks for your quick and detailed reply!

I was not able to reproduce this issue, could you double check you are really running both scenarios isolated and exactly with the same data?

You were right, I had a slightly different setup locally. I used fastparquet as parquet engine and accessed a nested column field a.b.c rather than a.b as my wrangler setup (I didn’t expect it to make such a big difference). I can confirm that the memory usage is almost equal with the same columns and using the pyarrow engine.

Btw, why are you using chunked=True in this case? Is it only for the sake of this troubleshooting?

Based on this line in the code, I expected chunked=True to make it more memory-efficient. Thanks for pointing out that this is not the case in my setup 😃

Also, I wonder why the Activity Monitor shows me a completely different amount of memory used by the program: Screenshot 2020-08-26 at 16 42 10

Anyway, as this is certainly not a memory leak, I will close this issue!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to reduce memory usage when loading large parquet ...
Describe the bug I am trying to load a dataset of 200 parquet files (≈11GB in total) from s3 and convert it into...
Read more >
Scaling to large datasets — pandas 1.1.5 documentation
Some workloads can be achieved with chunking: splitting a large problem like “convert this directory of CSVs to parquet” into a bunch of...
Read more >
How to handle BigData Files on Low Memory?
modify the data type to reduce memory usage. """ start_mem = df.memory_usage().sum() / 1024**2 print(('Memory usage of dataframe is {:.2f} ...
Read more >
Over-high memory usage during reading parquet in Python
When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back...
Read more >
14 Simple Tips to save RAM memory for 1+GB dataset - Kaggle
Technique 1: Free Memory using gc.collect()¶. In python notebook once a dataset loads into RAM it does not free on its own.So if...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found