`read_csv` fails with multi-node Ray Client
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- Modin version (
modin.__version__
): 0.10 - Python version: 3.8
- Code we can use to reproduce:
I create a sample
test.csv
file in the current working directory. Then I start a remote Ray cluster.
I then run this script from my laptop
import ray
import modin.pandas as pd
runtime_env = {"working_dir": ".", "pip": ["modin"]}
ray.client("<head_node_host>:10001").env(runtime_env).connect()
df = pd.read_csv("test.csv")
df.show()
And it fails with
Traceback (most recent call last):
File "test.py", line 7, in <module>
df = pd.read_csv("test.csv")
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/pandas/io.py", line 133, in read_csv
return _read(**kwargs)
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/pandas/io.py", line 59, in _read
pd_obj = FactoryDispatcher.read_csv(**kwargs)
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/data_management/factories/dispatcher.py", line 172, in read_csv
return cls.__factory._read_csv(**kwargs)
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/data_management/factories/factories.py", line 206, in _read_csv
return cls.io_cls.read_csv(**kwargs)
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 67, in read
query_compiler = cls._read(*args, **kwargs)
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/text/csv_dispatcher.py", line 160, in _read
new_query_compiler = cls._get_new_qc(
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/text/csv_dispatcher.py", line 302, in _get_new_qc
new_index, row_lengths = cls._define_index(index_ids, index_col_md, index_name)
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/text/csv_dispatcher.py", line 252, in _define_index
row_lengths = cls.materialize(index_ids)
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 82, in materialize
return ray.get(obj_id)
File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 61, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/client/worker.py", line 202, in get
res = self._get(obj_ref, op_timeout)
File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/client/worker.py", line 225, in _get
raise err
types.RayTaskError(FileNotFoundError): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
return function(*args, **kwargs)
File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
bio = FileDispatcher.file_open(
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'
Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
return function(*args, **kwargs)
File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
bio = FileDispatcher.file_open(
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'
Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
return function(*args, **kwargs)
File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
bio = FileDispatcher.file_open(
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'
Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
return function(*args, **kwargs)
File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
bio = FileDispatcher.file_open(
File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'
Describe the problem
It seems like Modin is expecting the csv file to exist both on the client and the server with the same path, which isn’t the case, and is thus failing.
Possible solutions to this:
- Require the user to use cloud storage or shared file system that’s accessible from both the client and the server (too much usability overhead)
- Read the csv entirely on the client side and require the csv to only exist on the client side (inefficient for large datasets)
- Read the csv entirely on the sever side and require the user to pass in the path of the csv file on the server (unintuitive for users)
- Have Modin support Ray runtime envs where path to file is different on client vs. server
- Have Ray runtime envs automatically reroute client file paths to server file paths
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (8 by maintainers)
Top Results From Across the Web
DataFrames: Read and Write Data - Dask Examples
Each CSV file holds timeseries data for that day. We can read all of them as one logical dataframe using the dd.read_csv function...
Read more >Ray Datasets: Distributed Data Preprocessing
Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. They provide basic distributed data transformations such...
Read more >IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object.
Read more >Release 0+untagged.50.g9a95a2f.dirty Modin contributors
assert NPartitions.get() == 768, "Not all Ray nodes are started up yet" ... To avoid the problem the Dask Client creation code needs...
Read more >Accelerating XGBoost on GPU Clusters with Dask
Even though the XGBoost Dask interface has reached feature parity with single node API, development is continuing for better integration with ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@devin-petersohn Will it work on Modin’s side if we provide an API
ray.get_runtime_context().get_working_dir(): str
which returns the current working directory? (If called on the client side, it will return/users/Alex/path/
, and if called in a Ray worker on the server side, it will return/home/ray/project/
, to use Alex’s example above.). That way, appending the relative path will always result in a valid absolute path, whether on the client side or on the server side. cc @iychengnit: maybe it should be a
pathlib.Path
?