Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

A long time before jobs start running when reading parquet files

See original GitHub issue

I am using dask-yarn on Amazon EMR almost exactly as described in the documentation including your handy bootstrap script (thanks).

I have ~16,000 parquet files (~110 GB on disk) that I’m trying to load into a dask dataframe from s3. The operation

import dask.dataframe as dd
dd.from_parquet('s3://bucket/path/to/parquet-files')

hangs longer than I have the patience to wait for (more than an hour). The workers never light up in the dask dashboard.

So, to try to get some insight, I wrote a function that just loads each parquet file into a pandas dataframe and returns its memory usage. I then randomly selected 100 of the files and submitted that function as a dask.delayed ie.

dask.compute(map(dask.delayed(get_memory_usage_of_parquet_file), subset_of_100_parquet_s3_paths))

This takes about 3 minutes to run. About 5 seconds of that is the actual work being done by the workers. Each file takes up about 70-80 MB in memory, so it’s not a memory overrun. (That’s why I was checking this.)

What am I missing? Is there some configuration change that will make this more performant?

Issue Analytics

State:
Created 4 years ago
Comments:29 (25 by maintainers)

Top GitHub Comments

1reaction

wesmcommented, Jul 26, 2019

Readability issues with https://github.com/dask/dask/blob/master/dask/dataframe/io/parquet/fastparquet.py aside (I had to spend some time parsing what is going on in read_metadata – I would personally refactor to make some helper functions to help with readability) I don’t see a problem with doing the same work for the Arrow engine. I believe there are still some follow up items having to do with the _metadata and _common_metadata files but I have not kept track of what they are. If you run into issues please do open JIRA issues so we don’t lose track of the work

1reaction

birdsarahcommented, Jul 23, 2019

Update: pyarrow now creates _metadata https://issues.apache.org/jira/browse/ARROW-1983 (v0.14.0)