question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`read_csv` fails with multi-node Ray Client

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Modin version (modin.__version__): 0.10
  • Python version: 3.8
  • Code we can use to reproduce: I create a sample test.csv file in the current working directory. Then I start a remote Ray cluster.

I then run this script from my laptop

import ray
import modin.pandas as pd

runtime_env = {"working_dir": ".", "pip": ["modin"]}
ray.client("<head_node_host>:10001").env(runtime_env).connect()

df = pd.read_csv("test.csv")
df.show()

And it fails with

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    df = pd.read_csv("test.csv")
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/pandas/io.py", line 133, in read_csv
    return _read(**kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/pandas/io.py", line 59, in _read
    pd_obj = FactoryDispatcher.read_csv(**kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/data_management/factories/dispatcher.py", line 172, in read_csv
    return cls.__factory._read_csv(**kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/data_management/factories/factories.py", line 206, in _read_csv
    return cls.io_cls.read_csv(**kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 67, in read
    query_compiler = cls._read(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/text/csv_dispatcher.py", line 160, in _read
    new_query_compiler = cls._get_new_qc(
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/text/csv_dispatcher.py", line 302, in _get_new_qc
    new_index, row_lengths = cls._define_index(index_ids, index_col_md, index_name)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/base/io/text/csv_dispatcher.py", line 252, in _define_index
    row_lengths = cls.materialize(index_ids)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 82, in materialize
    return ray.get(obj_id)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 61, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/client/worker.py", line 202, in get
    res = self._get(obj_ref, op_timeout)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/client/worker.py", line 225, in _get
    raise err
types.RayTaskError(FileNotFoundError): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
    return function(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
    bio = FileDispatcher.file_open(
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
    return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'
Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
    return function(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
    bio = FileDispatcher.file_open(
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
    return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'
Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
    return function(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
    bio = FileDispatcher.file_open(
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
    return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'
Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::deploy_ray_func() (pid=689, ip=172.31.51.109)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
    return function(*args, **kwargs)
  File "/Users/amog/dev/product/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
  File "/Users/amog/dev/product/lib/python3.8/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/backends/pandas/parsers.py", line 216, in parse
    bio = FileDispatcher.file_open(
  File "/tmp/ray/session_2021-06-22_17-00-38_682792_154/runtime_resources/conda/ray-39cb33edecee2d16bb9f77300d24774637f45439/lib/python3.8/site-packages/modin/engines/base/io/file_dispatcher.py", line 199, in file_open
    return open(file_path, mode=mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/amog/dev/test_project/test.csv'

Describe the problem

It seems like Modin is expecting the csv file to exist both on the client and the server with the same path, which isn’t the case, and is thus failing.

Possible solutions to this:

  • Require the user to use cloud storage or shared file system that’s accessible from both the client and the server (too much usability overhead)
  • Read the csv entirely on the client side and require the csv to only exist on the client side (inefficient for large datasets)
  • Read the csv entirely on the sever side and require the user to pass in the path of the csv file on the server (unintuitive for users)
  • Have Modin support Ray runtime envs where path to file is different on client vs. server
  • Have Ray runtime envs automatically reroute client file paths to server file paths

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:12 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
architkulkarnicommented, Jun 29, 2021

@devin-petersohn Will it work on Modin’s side if we provide an API ray.get_runtime_context().get_working_dir(): str which returns the current working directory? (If called on the client side, it will return /users/Alex/path/, and if called in a Ray worker on the server side, it will return /home/ray/project/, to use Alex’s example above.). That way, appending the relative path will always result in a valid absolute path, whether on the client side or on the server side. cc @iycheng

1reaction
wuisawesomecommented, Jun 29, 2021

nit: maybe it should be a pathlib.Path?

Read more comments on GitHub >

github_iconTop Results From Across the Web

DataFrames: Read and Write Data - Dask Examples
Each CSV file holds timeseries data for that day. We can read all of them as one logical dataframe using the dd.read_csv function...
Read more >
Ray Datasets: Distributed Data Preprocessing
Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. They provide basic distributed data transformations such...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object.
Read more >
Release 0+untagged.50.g9a95a2f.dirty Modin contributors
assert NPartitions.get() == 768, "Not all Ray nodes are started up yet" ... To avoid the problem the Dask Client creation code needs...
Read more >
Accelerating XGBoost on GPU Clusters with Dask
Even though the XGBoost Dask interface has reached feature parity with single node API, development is continuing for better integration with ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found