Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`read_csv` fails with multi-node Ray Client

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Modin version (modin.__version__): 0.10
Python version: 3.8
Code we can use to reproduce: I create a sample test.csv file in the current working directory. Then I start a remote Ray cluster.

I then run this script from my laptop

import ray
import modin.pandas as pd

runtime_env = {"working_dir": ".", "pip": ["modin"]}
ray.client("<head_node_host>:10001").env(runtime_env).connect()

df = pd.read_csv("test.csv")
df.show()

And it fails with

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    df = pd.read_csv("test.csv")
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/pandas/io.py", line 133, in read_csv
    return _read(**kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/pandas/io.py", line 59, in _read
    pd_obj = FactoryDispatcher.read_csv(**kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/data_management/factories/dispatcher.py", line 172, in read_csv
    return cls.__factory._read_csv(**kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/data_management/factories/factories.py", line 206, in _read_csv
    return cls.io_cls.read_csv(**kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 67, in read
    query_compiler = cls._read(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/text/csv_dispatcher.py", line 160, in _read
    new_query_compiler = cls._get_new_qc(
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/text/csv_dispatcher.py", line 302, in _get_new_qc
    new_index, row_lengths = cls._define_index(index_ids, index_col_md, index_name)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/text/csv_dispatcher.py", line 252, in _define_index
    row_lengths = cls.materialize(index_ids)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 82, in materialize
    return ray.get(obj_id)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 61, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/client/worker.py", line 202, in get
    res = self._get(obj_ref, op_timeout)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/client/worker.py", line 225, in _get
    raise err
types.RayTaskError(FileNotFoundError): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
    return function(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
    bio = FileDispatcher.file_open(
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
    return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'
Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
    return function(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
    bio = FileDispatcher.file_open(
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
    return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'
Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
    return function(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
    bio = FileDispatcher.file_open(
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
    return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'
Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
    return function(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
    bio = FileDispatcher.file_open(
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
    return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'

Describe the problem

It seems like Modin is expecting the csv file to exist both on the client and the server with the same path, which isn’t the case, and is thus failing.

Possible solutions to this:

Require the user to use cloud storage or shared file system that’s accessible from both the client and the server (too much usability overhead)
Read the csv entirely on the client side and require the csv to only exist on the client side (inefficient for large datasets)
Read the csv entirely on the sever side and require the user to pass in the path of the csv file on the server (unintuitive for users)
Have Modin support Ray runtime envs where path to file is different on client vs. server
Have Ray runtime envs automatically reroute client file paths to server file paths

Issue Analytics

State:
Created 2 years ago
Comments:12 (8 by maintainers)

Top GitHub Comments

2reactions

architkulkarnicommented, Jun 29, 2021

@devin-petersohn Will it work on Modin’s side if we provide an API ray.get_runtime_context().get_working_dir(): str which returns the current working directory? (If called on the client side, it will return /users/Alex/path/, and if called in a Ray worker on the server side, it will return /home/ray/project/, to use Alex’s example above.). That way, appending the relative path will always result in a valid absolute path, whether on the client side or on the server side. cc @iycheng

1reaction

wuisawesomecommented, Jun 29, 2021

nit: maybe it should be a pathlib.Path?

Top Results From Across the Web

DataFrames: Read and Write Data - Dask Examples

Each CSV file holds timeseries data for that day. We can read all of them as one logical dataframe using the dd.read_csv function...

Ray Datasets: Distributed Data Preprocessing

Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. They provide basic distributed data transformations such...

IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object.

Release 0+untagged.50.g9a95a2f.dirty Modin contributors

assert NPartitions.get() == 768, "Not all Ray nodes are started up yet" ... To avoid the problem the Dask Client creation code needs...

Accelerating XGBoost on GPU Clusters with Dask

Even though the XGBoost Dask interface has reached feature parity with single node API, development is continuing for better integration with ...