A long time before jobs start running when reading parquet files
See original GitHub issueI am using dask-yarn on Amazon EMR almost exactly as described in the documentation including your handy bootstrap script (thanks).

I have ~16,000 parquet files (~110 GB on disk) that I’m trying to load into a dask dataframe from s3. The operation
import dask.dataframe as dd
dd.from_parquet('s3://bucket/path/to/parquet-files')
hangs longer than I have the patience to wait for (more than an hour). The workers never light up in the dask dashboard.
So, to try to get some insight, I wrote a function that just loads each parquet file into a pandas dataframe and returns its memory usage. I then randomly selected 100 of the files and submitted that function as a dask.delayed ie.
dask.compute(map(dask.delayed(get_memory_usage_of_parquet_file), subset_of_100_parquet_s3_paths))
This takes about 3 minutes to run. About 5 seconds of that is the actual work being done by the workers. Each file takes up about 70-80 MB in memory, so it’s not a memory overrun. (That’s why I was checking this.)
What am I missing? Is there some configuration change that will make this more performant?
Issue Analytics
- State:
- Created 4 years ago
- Comments:29 (25 by maintainers)

Top Related StackOverflow Question
Readability issues with https://github.com/dask/dask/blob/master/dask/dataframe/io/parquet/fastparquet.py aside (I had to spend some time parsing what is going on in read_metadata – I would personally refactor to make some helper functions to help with readability) I don’t see a problem with doing the same work for the Arrow engine. I believe there are still some follow up items having to do with the _metadata and _common_metadata files but I have not kept track of what they are. If you run into issues please do open JIRA issues so we don’t lose track of the work
Update: pyarrow now creates _metadata https://issues.apache.org/jira/browse/ARROW-1983 (v0.14.0)