Add documentation about reading local files in a cluster environment
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 21.04
- Modin version (
modin.__version__
): 0.10.2 - Python version: 3.7
- Code we can use to reproduce:
import modin.pandas as pd # traceback produced as seen below
# import pandas as pd # works
import ray
ray.init(address='auto', _redis_password='xxx')
pd.read_parquet("abs_path_to_parquet_df")
Describe the problem
---------------------------------------------------------------------------
RayTaskError(FileNotFoundError) Traceback (most recent call last)
/tmp/ipykernel_192025/3678036683.py in <module>
8 pairs = pickle.load(f)
9 score_matrix = pd.read_parquet(path + "score_matrix_" + SAMPLE_RESOLUTION_RULE + ".parquet")
---> 10 pvalue_matrix = pd.read_parquet(path + "pvalue_matrix_" + SAMPLE_RESOLUTION_RULE + ".parquet")
~/miniconda3/envs/test/lib/python3.7/site-packages/modin/pandas/io.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
216 storage_options=storage_options,
217 use_nullable_dtypes=use_nullable_dtypes,
--> 218 **kwargs,
219 )
220 )
~/miniconda3/envs/test/lib/python3.7/site-packages/modin/data_management/factories/dispatcher.py in read_parquet(cls, **kwargs)
165 @_inherit_docstrings(factories.BaseFactory._read_parquet)
166 def read_parquet(cls, **kwargs):
--> 167 return cls.__factory._read_parquet(**kwargs)
168
169 @classmethod
~/miniconda3/envs/test/lib/python3.7/site-packages/modin/data_management/factories/factories.py in _read_parquet(cls, **kwargs)
194 )
195 def _read_parquet(cls, **kwargs):
--> 196 return cls.io_cls.read_parquet(**kwargs)
197
198 @classmethod
~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/file_dispatcher.py in read(cls, *args, **kwargs)
65 postprocessing work on the resulting query_compiler object.
66 """
---> 67 query_compiler = cls._read(*args, **kwargs)
68 # TODO (devin-petersohn): Make this section more general for non-pandas kernel
69 # implementations.
~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/column_stores/parquet_dispatcher.py in _read(cls, path, engine, columns, **kwargs)
134 column_names = [c for c in column_names if c not in index_columns]
135 columns = [name for name in column_names if not PQ_INDEX_REGEX.match(name)]
--> 136 return cls.build_query_compiler(path, columns, **kwargs)
~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/column_stores/column_store_dispatcher.py in build_query_compiler(cls, path, columns, **kwargs)
218 index, row_lens = cls.build_index(partition_ids)
219 remote_parts = cls.build_partition(partition_ids[:-2], row_lens, column_widths)
--> 220 dtypes = cls.build_dtypes(partition_ids[-1], columns)
221 new_query_compiler = cls.query_compiler_cls(
222 cls.frame_cls(
~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/base/io/column_stores/column_store_dispatcher.py in build_dtypes(cls, partition_ids, columns)
191 Series with dtypes for columns.
192 """
--> 193 dtypes = pandas.concat(cls.materialize(list(partition_ids)), axis=0)
194 dtypes.index = columns
195 return dtypes
~/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/ray/task_wrapper.py in materialize(cls, obj_id)
80 Whatever was identified by `obj_id`.
81 """
---> 82 return ray.get(obj_id)
~/miniconda3/envs/test/lib/python3.7/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
80 if client_mode_should_convert():
81 return getattr(ray, func.__name__)(*args, **kwargs)
---> 82 return func(*args, **kwargs)
83
84 return wrapper
~/miniconda3/envs/test/lib/python3.7/site-packages/ray/worker.py in get(object_refs, timeout)
1619 worker.core_worker.dump_object_store_memory_usage()
1620 if isinstance(value, RayTaskError):
-> 1621 raise value.as_instanceof_cause()
1622 else:
1623 raise value
RayTaskError(FileNotFoundError): ray::deploy_ray_func() (pid=1831567, ip=192.168.0.101)
File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/modin/engines/ray/task_wrapper.py", line 40, in deploy_ray_func
return func(**args)
File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/modin/backends/pandas/parsers.py", line 595, in parse
df = pandas.read_parquet(fname, **kwargs)
File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parquet.py", line 500, in read_parquet
**kwargs,
File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parquet.py", line 236, in read
mode="rb",
File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parquet.py", line 102, in _get_path_or_handle
path_or_handle, mode, is_text=False, storage_options=storage_options
File "/home/test/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/common.py", line 710, in get_handle
handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: parquet_df.parquet
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
Example: Read Files from the Cluster Local Filesystem
Use the following command in the terminal to read text from the local filesystem. The file must exist on all hosts, and the...
Read more >Organizing Cluster Access Using kubeconfig Files - Kubernetes
Use kubeconfig files to organize information about clusters, users, namespaces, and authentication mechanisms. The kubectl command-line tool ...
Read more >Environment Dependencies — Ray 2.2.0
A runtime environment describes the dependencies your Ray application needs to run, including files, packages, environment variables, and more. It is installed ...
Read more >Cluster Mode Overview - Spark 3.3.1 Documentation
This document gives a short overview of how Spark runs on clusters, to make it easier to understand the components involved. Read through...
Read more >Configure clusters - Azure Databricks - Microsoft Learn
Learn how to configure Azure Databricks clusters, including cluster mode, runtime, instance types, size, pools, autoscaling preferences, ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Agree, I am going to reopen this to track that task if you don’t mind.
Duplicate of #4479