Ray crashes when writing large df to csv
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
- Modin version (
modin.__version__
): pip installed from branch rehan/issues/2656 - Python version: 3.8
- Code we can use to reproduce:
import modin.pandas as pd
df = pd.concat([
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-02.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-03.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-04.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-05.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-06.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-07.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-08.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-09.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-10.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-11.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-12.csv", quoting=3),
])
df.to_csv('nyc_taxi.csv', index=False)
Describe the problem
Ray seems to crash when I try to write a large data frame to a CSV. This data frame is about 23 GB.
Source code / logs
(pid=86889) [2021-04-20 19:47:02,782 C 86889 88394] core_worker.cc:190: Check failed: instance_ The core worker process is not initialized yet or already shutdown.
(pid=86889) *** StackTrace Information ***
(pid=86889) @ 0x7f3db0e9b795 google::GetStackTraceToString()
(pid=86889) @ 0x7f3db0e12efe ray::GetCallTrace()
(pid=86889) @ 0x7f3db0e38304 ray::RayLog::~RayLog()
(pid=86889) @ 0x7f3db09fe4e2 ray::CoreWorkerProcess::EnsureInitialized()
(pid=86889) @ 0x7f3db0a07352 ray::CoreWorkerProcess::GetCoreWorker()
(pid=86889) @ 0x7f3db095f8c8 __pyx_pw_3ray_7_raylet_10CoreWorker_61profile_event()
(pid=86889) @ 0x55a3594e81c7 method_vectorcall_VARARGS_KEYWORDS
(pid=86889) @ 0x55a35944375e _PyEval_EvalFrameDefault.cold.2790
(pid=86889) @ 0x55a3594cda92 _PyEval_EvalCodeWithName
(pid=86889) @ 0x55a3594ce943 _PyFunction_Vectorcall.localalias.355
(pid=86889) @ 0x55a35944377f _PyEval_EvalFrameDefault.cold.2790
(pid=86889) @ 0x55a3594ce86b _PyFunction_Vectorcall.localalias.355
(pid=86889) @ 0x55a35944375e _PyEval_EvalFrameDefault.cold.2790
(pid=86889) @ 0x55a3594ce86b _PyFunction_Vectorcall.localalias.355
(pid=86889) @ 0x55a3594ceee7 method_vectorcall
(pid=86889) @ 0x55a359480041 PyVectorcall_Call
(pid=86889) @ 0x55a35950599b _PyEval_EvalFrameDefault
(pid=86889) @ 0x55a3594ce86b _PyFunction_Vectorcall.localalias.355
(pid=86889) @ 0x55a35944375e _PyEval_EvalFrameDefault.cold.2790
(pid=86889) @ 0x55a3594ce86b _PyFunction_Vectorcall.localalias.355
(pid=86889) @ 0x55a35944375e _PyEval_EvalFrameDefault.cold.2790
(pid=86889) @ 0x55a3594ce86b _PyFunction_Vectorcall.localalias.355
(pid=86889) @ 0x55a3594ceee7 method_vectorcall
(pid=86889) @ 0x55a359480041 PyVectorcall_Call
(pid=86889) @ 0x55a3595788be t_bootstrap
(pid=86889) @ 0x55a359525708 pythread_wrapper
(pid=86889) @ 0x7f3db20a6609 start_thread
(pid=86889) @ 0x7f3db1fcd293 clone
(pid=86889)
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (5 by maintainers)
Top Results From Across the Web
How do I resolve localRayletDiedError when using Modin with ...
I am trying to use the Modin engine to process a large dataframe: df.head(20): ...
Read more >Scaling to large datasets — pandas 1.5.2 documentation
pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky.
Read more >8 Must-Know Tricks to Use S3 More Effectively in Python
Imagine that you want to read a CSV file into a Pandas dataframe ... Many analytical databases can process larger batches of data...
Read more >Dataset API — Ray 2.2.0
Datasets support conversion to/from several more featureful dataframe ... Write the dataset to csv. ... Convert this dataset into a Modin dataframe.
Read more >Write to a Single CSV File - Databricks
It will only work with small datasets. If you try to convert a large dataframe to a pandas dataframe, you could crash the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Can do - we’re getting similar bugs when sorting a 10M x 100 dataframe - Ray workers seem to be OOM-ing and dying when the object store has 14.98 GB - only by quadrupling the object store do we get rid of the OOMs.
I opened an issue here: https://github.com/ray-project/ray/issues/29668
@RehanSD it really depends on your implementation. i.e. it could be that the library (modin) uses a lot of heap memory when doing the sort; which likely leads to OOM.
We have an experimental feature available in ray master and next release. https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html I’d suggest to enable it, it will give you a bit more information if the application is really using excessive memory.