question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Workers stuck, increased memory usage while processing large CSV from S3.

See original GitHub issue

I’m processing a dataframe stored as a (relatively) large CSV on S3. Using distributed scheduler with multiprocessing (1 thread per 1 worker process, --no-nanny). Workers seem to be accumulating data and getting stuck, in some cases this also leads to failure of whole job.

I came up with minimal reproducing example as below (only read/write CSV)

frame = df.read_csv(input_url,
                    collection=True,
                    blocksize=1024*1024,
                    compression=None,
                    lineterminator='\n',
                    dtype=str,
                    sep=',',
                    quotechar='"',
                    encoding='utf-8')

fun_list = frame.to_csv(output_url,
                        compute=False,
                        encoding='utf-8',
                        index=False,
                        index_label=False)

futures = client.compute(fun_list)
progress(futures)
client.gather(futures)

This would hang forever with progress at 0%. In worker log: distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 7.16 GB -- Worker memory limit: 8.02 GB

The file itself is only 1.2GB though. Using distributed 1.19.2 and dask 0.15.4

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:12
  • Comments:32 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
manugarricommented, Sep 10, 2019

Just commenting, disabling the chain assigment pandas option made my ETL job go from running out of memory after 90 minutes to taking 17 minutes! I think we can close this issue since its related to pandas (and thanks @jeffreyliu a year and a half later for your comment!)

1reaction
jeffreyliucommented, Jan 31, 2018

Yes, that seemed to be the issue. This thread helped get it to work.

Turning off the pandas option: pd.options.mode.chained_assignment = None allows the dataframe to load in a reasonable amount of time.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Workers stuck, increased memory usage while processing large ...
I'm processing a dataframe stored as a (relatively) large CSV on S3. Using distributed scheduler with multiprocessing (1 thread per 1 worker process, ......
Read more >
java.lang.OutOfMemoryError While processing a Large CSV file
Applcation starts a thread to dwonload CSV from S3 and process it. Application works file for some time but OutOfMemoryError half way processing ......
Read more >
Dask and Pandas: There's No Such Thing as Too Much Data
Data is too large to hold in memory (memory constraint). If you find yourself heavily downsampling data that might otherwise be useful, because ......
Read more >
Apache Spark: Out Of Memory Issue? - Clairvoyant Blog
We can solve this problem with two approaches: either use spark.driver.maxResultSize or repartition. Setting a proper limit using spark.driver.
Read more >
Avoiding Rookie Mistakes when using AWS Glue
Less tasks to maintain means less memory pressure on the driver, increased performance and a more robust ETL. AWS Glue Crawler. AWS Glue...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found