read_parquet fails in Modin 0.4.0
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
- Modin installed from (source or binary): Binary
- Modin version: 0.4.0
- Python version: 3.6.5
- Exact command to reproduce: df = pd.read_parquet(‘df.parquet.gzip’)
Describe the problem
As shown below, reading with read_parquet
fails, and I have verified that read_parquet
works in the supported version of Pandas (0.24.1).
Source code / logs
In [1]: import modin.pandas as pd
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-15_15-54-26_15037/logs.
Waiting for redis server at 127.0.0.1:62691 to respond...
Waiting for redis server at 127.0.0.1:16219 to respond...
Starting Redis shard with 10.0 GB max memory.
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 8289193984 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 10.0 GB memory using /tmp.
In [2]: df = pd.read_parquet('df.parquet.gzip')
---------------------------------------------------------------------------
RayTaskError Traceback (most recent call last)
<ipython-input-2-3d2fa0958ff8> in <module>()
----> 1 df = pd.read_parquet('df.parquet.gzip')
~/miniconda3/envs/modinp36/lib/python3.6/site-packages/modin/pandas/io.py in read_parquet(path, engine, columns, **kwargs)
26 return DataFrame(
27 query_compiler=BaseFactory.read_parquet(
---> 28 path=path, columns=columns, engine=engine, **kwargs
29 )
30 )
~/miniconda3/envs/modinp36/lib/python3.6/site-packages/modin/data_management/factories.py in read_parquet(cls, **kwargs)
45 @classmethod
46 def read_parquet(cls, **kwargs):
---> 47 return cls._determine_engine()._read_parquet(**kwargs)
48
49 @classmethod
~/miniconda3/envs/modinp36/lib/python3.6/site-packages/modin/data_management/factories.py in _read_parquet(cls, **kwargs)
49 @classmethod
50 def _read_parquet(cls, **kwargs):
---> 51 return cls.io_cls.read_parquet(**kwargs)
52
53 @classmethod
~/miniconda3/envs/modinp36/lib/python3.6/site-packages/modin/engines/ray/pandas_on_ray/io.py in read_parquet(cls, path, engine, columns, **kwargs)
85 ]
86 )
---> 87 index_len = ray.get(blk_partitions[-1][0])
88 index = pandas.RangeIndex(index_len)
89 new_query_compiler = PandasQueryCompiler(
~/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/worker.py in get(object_ids, worker)
2209 # here.
2210 last_task_error_raise_time = time.time()
-> 2211 raise value
2212 return value
2213
RayTaskError: ray_worker:modin.engines.ray.pandas_on_ray.io._read_parquet_columns() (pid=15254, host=myhostname)
File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/modin/engines/ray/pandas_on_ray/io.py", line 702, in _read_parquet_columns
df = pq.read_pandas(path, columns=columns, **kwargs).to_pandas()
File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 1128, in read_pandas
use_pandas_metadata=True)
File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 1107, in read_table
use_pandas_metadata=use_pandas_metadata)
File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/filesystem.py", line 181, in read_parquet
use_pandas_metadata=use_pandas_metadata)
File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 973, in read
use_pandas_metadata=use_pandas_metadata)
File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 529, in read
table = reader.read(**options)
File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 210, in read
columns, use_pandas_metadata=use_pandas_metadata)
File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 259, in _get_column_indices
indices += map(self.reader.column_name_idx, index_columns)
File "pyarrow/_parquet.pyx", line 771, in pyarrow._parquet.ParquetReader.column_name_idx
TypeError: unhashable type: 'dict'
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
modin pandas read_parquet() failed on ETag KeyError trying ...
This issue on the Modin GitHub tracked support for reading partitioned files with read_parquet in Modin, as you are trying to do here....
Read more >pd.read_<file> and I/O APIs - Modin
A number of IO methods default to pandas. We have parallelized read_csv and read_parquet , though many of the remaining methods can be...
Read more >Read parquet file error - MATLAB Central - MathWorks
I'm reading parquet files and facing some problems. For comparison the file was read with python using fastparquet with no errors. The file...
Read more >geopandas.read_parquet
Supports versions 0.1.0, 0.4.0 of the GeoParquet specification at: opengeospatial/geoparquet. If 'crs' key is not present in the GeoParquet metadata ...
Read more >parquet 0.4.1 - Docs.rs
docs.rs failed to build parquet-0.4.1. Please check the build logs for more information. See Builds for ideas on how to fix a failed...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @Shellcat-Zero, thanks that helps! I am still not able to reproduce the error, either on master or on Modin 0.4, but this does narrow down the issue.
If you change the
pandas
part to:Does that work if read into Modin?
What is your local
pyarrow
version? (Make sure you run in a new interpreter)Closing this. Feel free to reopen if the discussion should continue or if issue was not resolved.