Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray crashes when writing large df to csv

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
Modin version (modin.__version__): pip installed from branch rehan/issues/2656
Python version: 3.8
Code we can use to reproduce:

import modin.pandas as pd
df = pd.concat([
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-02.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-03.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-04.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-05.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-06.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-07.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-08.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-09.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-10.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-11.csv", quoting=3),
pd.read_csv("s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-12.csv", quoting=3),
])

df.to_csv('nyc_taxi.csv', index=False)

Describe the problem

Ray seems to crash when I try to write a large data frame to a CSV. This data frame is about 23 GB.

Source code / logs

(pid=86889) [2021-04-20 19:47:02,782 C 86889 88394] core_worker.cc:190:  Check failed: instance_ The core worker process is not initialized yet or already shutdown.
(pid=86889) *** StackTrace Information ***
(pid=86889)     @     0x7f3db0e9b795  google::GetStackTraceToString()
(pid=86889)     @     0x7f3db0e12efe  ray::GetCallTrace()
(pid=86889)     @     0x7f3db0e38304  ray::RayLog::~RayLog()
(pid=86889)     @     0x7f3db09fe4e2  ray::CoreWorkerProcess::EnsureInitialized()
(pid=86889)     @     0x7f3db0a07352  ray::CoreWorkerProcess::GetCoreWorker()
(pid=86889)     @     0x7f3db095f8c8  __pyx_pw_3ray_7_raylet_10CoreWorker_61profile_event()
(pid=86889)     @     0x55a3594e81c7  method_vectorcall_VARARGS_KEYWORDS
(pid=86889)     @     0x55a35944375e  _PyEval_EvalFrameDefault.cold.2790
(pid=86889)     @     0x55a3594cda92  _PyEval_EvalCodeWithName
(pid=86889)     @     0x55a3594ce943  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a35944377f  _PyEval_EvalFrameDefault.cold.2790
(pid=86889)     @     0x55a3594ce86b  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a35944375e  _PyEval_EvalFrameDefault.cold.2790
(pid=86889)     @     0x55a3594ce86b  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a3594ceee7  method_vectorcall
(pid=86889)     @     0x55a359480041  PyVectorcall_Call
(pid=86889)     @     0x55a35950599b  _PyEval_EvalFrameDefault
(pid=86889)     @     0x55a3594ce86b  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a35944375e  _PyEval_EvalFrameDefault.cold.2790
(pid=86889)     @     0x55a3594ce86b  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a35944375e  _PyEval_EvalFrameDefault.cold.2790
(pid=86889)     @     0x55a3594ce86b  _PyFunction_Vectorcall.localalias.355
(pid=86889)     @     0x55a3594ceee7  method_vectorcall
(pid=86889)     @     0x55a359480041  PyVectorcall_Call
(pid=86889)     @     0x55a3595788be  t_bootstrap
(pid=86889)     @     0x55a359525708  pythread_wrapper
(pid=86889)     @     0x7f3db20a6609  start_thread
(pid=86889)     @     0x7f3db1fcd293  clone
(pid=86889)

Issue Analytics

State:
Created 2 years ago
Comments:13 (5 by maintainers)

Top GitHub Comments

1reaction

RehanSDcommented, Oct 25, 2022

Can do - we’re getting similar bugs when sorting a 10M x 100 dataframe - Ray workers seem to be OOM-ing and dying when the object store has 14.98 GB - only by quadrupling the object store do we get rid of the OOMs.

I opened an issue here: https://github.com/ray-project/ray/issues/29668

0reactions

scv119commented, Oct 26, 2022

@RehanSD it really depends on your implementation. i.e. it could be that the library (modin) uses a lot of heap memory when doing the sort; which likely leads to OOM.

We have an experimental feature available in ray master and next release. https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html I’d suggest to enable it, it will give you a bit more information if the application is really using excessive memory.

Top Results From Across the Web

How do I resolve localRayletDiedError when using Modin with ...

I am trying to use the Modin engine to process a large dataframe: df.head(20): ...

Scaling to large datasets — pandas 1.5.2 documentation

pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky.

8 Must-Know Tricks to Use S3 More Effectively in Python

Imagine that you want to read a CSV file into a Pandas dataframe ... Many analytical databases can process larger batches of data...

Dataset API — Ray 2.2.0

Datasets support conversion to/from several more featureful dataframe ... Write the dataset to csv. ... Convert this dataset into a Modin dataframe.

Write to a Single CSV File - Databricks

It will only work with small datasets. If you try to convert a large dataframe to a pandas dataframe, you could crash the...