question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

A long time before jobs start running when reading parquet files

See original GitHub issue

I am using dask-yarn on Amazon EMR almost exactly as described in the documentation including your handy bootstrap script (thanks).

image

I have ~16,000 parquet files (~110 GB on disk) that I’m trying to load into a dask dataframe from s3. The operation

import dask.dataframe as dd
dd.from_parquet('s3://bucket/path/to/parquet-files')

hangs longer than I have the patience to wait for (more than an hour). The workers never light up in the dask dashboard.

So, to try to get some insight, I wrote a function that just loads each parquet file into a pandas dataframe and returns its memory usage. I then randomly selected 100 of the files and submitted that function as a dask.delayed ie.

dask.compute(map(dask.delayed(get_memory_usage_of_parquet_file), subset_of_100_parquet_s3_paths))

This takes about 3 minutes to run. About 5 seconds of that is the actual work being done by the workers. Each file takes up about 70-80 MB in memory, so it’s not a memory overrun. (That’s why I was checking this.)

What am I missing? Is there some configuration change that will make this more performant?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:29 (25 by maintainers)

github_iconTop GitHub Comments

1reaction
wesmcommented, Jul 26, 2019

Readability issues with https://github.com/dask/dask/blob/master/dask/dataframe/io/parquet/fastparquet.py aside (I had to spend some time parsing what is going on in read_metadata – I would personally refactor to make some helper functions to help with readability) I don’t see a problem with doing the same work for the Arrow engine. I believe there are still some follow up items having to do with the _metadata and _common_metadata files but I have not kept track of what they are. If you run into issues please do open JIRA issues so we don’t lose track of the work

1reaction
birdsarahcommented, Jul 23, 2019

Update: pyarrow now creates _metadata https://issues.apache.org/jira/browse/ARROW-1983 (v0.14.0)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why spark.read.parquet() runs 2 jobs? - Stack Overflow
Spark reads the file twice. 1- To evolve the schema 2- To create the dataFrame. Once the schema will be generated, the dataFrame...
Read more >
Diving into Spark and Parquet Workloads, by Example
The next step is to use the Spark Dataframe API to lazily read the files from Parquet and register the resulting DataFrame as...
Read more >
Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
Read more >
Apache Spark job fails with Parquet column ... - Microsoft Learn
Problem. You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted...
Read more >
Parquet files - Databricks Community
I want to process some parquet files (with snappy compression) using ... trying to run a streaming job in databricks, used Autoloader approach...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found