Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to reduce memory usage when loading large parquet dataset?

See original GitHub issue

Describe the bug I am trying to load a dataset of 200 parquet files (≈11GB in total) from s3 and convert it into a DataFrame. Note that I only use a small subset of columns so most of the data is redundant. When I download the data manually, load them one by one using pd.read_parquet, and merge them using pd.concat, the program uses ≈12GB of RAM. However, when I try doing the same using wrangler, I can see it using up to ≈30GB of RAM before it errors with

  File "/Users/jannikbertram/opt/miniconda3/envs/wit/lib/python3.8/site-packages/pyarrow/parquet.py", line 270, in read_row_group
    return self.reader.read_row_group(i, column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1079, in pyarrow._parquet.ParquetReader.read_row_group
  File "pyarrow/_parquet.pyx", line 1098, in pyarrow._parquet.ParquetReader.read_row_groups
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: IOError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer')). Detail: Python exception: ProtocolError

To Reproduce I installed wrangler using pip within a miniconda environment, python version 3.8.2 running on macOS Catalina.

dfs = wr.s3.read_parquet(
  s3_path_to_folder_with_parquet_files,
  dataset=True,
  columns=['column1', 'column2', 'column3'],
  chunked=True
)
df = pd.concat(dfs)

To reproduce, obviously, you also need a dataset of similar size.

Can somebody explain what’s happening here? Perhaps it’s just a mistake in my setup or this behaviour is intentional?

Issue Analytics

State:
Created 3 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

igorborgestcommented, Aug 26, 2020

@janbe-ts

Based on this line in the code, I expected chunked=True to make it more memory-efficient. Thanks for pointing out that this is not the case in my setup 😃

I will try to improve this explanation to make things cleaner. But it will only help you to save memory if you process each chunk at once.

e.g.

for df in wr.s3.read_parquet(path, dataset=True, columns=["lat1"], chunked=True):
    # Do whatever do you want with df
    wr.s3.to_parquet(...)

Also, I wonder why the Activity Monitor shows me a completely different amount of memory used by the program

Hmm… I also don’t know the difference reason, but let me know if you find out.

P.S. If you want save memory, you should consider the categories argument, it helps in some cases.

0reactions

janbe-tscommented, Aug 26, 2020

Hi @igorborgest, thanks for your quick and detailed reply!

I was not able to reproduce this issue, could you double check you are really running both scenarios isolated and exactly with the same data?

You were right, I had a slightly different setup locally. I used fastparquet as parquet engine and accessed a nested column field a.b.c rather than a.b as my wrangler setup (I didn’t expect it to make such a big difference). I can confirm that the memory usage is almost equal with the same columns and using the pyarrow engine.

Btw, why are you using chunked=True in this case? Is it only for the sake of this troubleshooting?

Based on this line in the code, I expected chunked=True to make it more memory-efficient. Thanks for pointing out that this is not the case in my setup 😃

Also, I wonder why the Activity Monitor shows me a completely different amount of memory used by the program: Screenshot 2020-08-26 at 16 42 10

Anyway, as this is certainly not a memory leak, I will close this issue!

Top Results From Across the Web

How to reduce memory usage when loading large parquet ...

Describe the bug I am trying to load a dataset of 200 parquet files (≈11GB in total) from s3 and convert it into...

Scaling to large datasets — pandas 1.1.5 documentation

Some workloads can be achieved with chunking: splitting a large problem like “convert this directory of CSVs to parquet” into a bunch of...

How to handle BigData Files on Low Memory?

modify the data type to reduce memory usage. """ start_mem = df.memory_usage().sum() / 1024**2 print(('Memory usage of dataframe is {:.2f} ...

Over-high memory usage during reading parquet in Python

When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back...

14 Simple Tips to save RAM memory for 1+GB dataset - Kaggle

Technique 1: Free Memory using gc.collect()¶. In python notebook once a dataset loads into RAM it does not free on its own.So if...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

How to reduce memory usage when loading large parquet dataset?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

to_parquet() - dtype parameter

Getting error ResourceNumberLimitExceededException) when calling the UpdateTable operation: Number of TABLE_VERSION resources exceeds the limit 100000 per TABLE