Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_parquet fails in Modin 0.4.0

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
Modin installed from (source or binary): Binary
Modin version: 0.4.0
Python version: 3.6.5
Exact command to reproduce: df = pd.read_parquet(‘df.parquet.gzip’)

Describe the problem

As shown below, reading with read_parquet fails, and I have verified that read_parquet works in the supported version of Pandas (0.24.1).

Source code / logs

In [1]: import modin.pandas as pd
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-15_15-54-26_15037/logs.
Waiting for redis server at 127.0.0.1:62691 to respond...
Waiting for redis server at 127.0.0.1:16219 to respond...
Starting Redis shard with 10.0 GB max memory.
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 8289193984 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 10.0 GB memory using /tmp.

In [2]: df = pd.read_parquet('df.parquet.gzip')
---------------------------------------------------------------------------
RayTaskError                              Traceback (most recent call last)
<ipython-input-2-3d2fa0958ff8> in <module>()
----> 1 df = pd.read_parquet('df.parquet.gzip')

~/miniconda3/envs/modinp36/lib/python3.6/site-packages/modin/pandas/io.py in read_parquet(path, engine, columns, **kwargs)
     26     return DataFrame(
     27         query_compiler=BaseFactory.read_parquet(
---> 28             path=path, columns=columns, engine=engine, **kwargs
     29         )
     30     )

~/miniconda3/envs/modinp36/lib/python3.6/site-packages/modin/data_management/factories.py in read_parquet(cls, **kwargs)
     45     @classmethod
     46     def read_parquet(cls, **kwargs):
---> 47         return cls._determine_engine()._read_parquet(**kwargs)
     48
     49     @classmethod

~/miniconda3/envs/modinp36/lib/python3.6/site-packages/modin/data_management/factories.py in _read_parquet(cls, **kwargs)
     49     @classmethod
     50     def _read_parquet(cls, **kwargs):
---> 51         return cls.io_cls.read_parquet(**kwargs)
     52
     53     @classmethod

~/miniconda3/envs/modinp36/lib/python3.6/site-packages/modin/engines/ray/pandas_on_ray/io.py in read_parquet(cls, path, engine, columns, **kwargs)
     85             ]
     86         )
---> 87         index_len = ray.get(blk_partitions[-1][0])
     88         index = pandas.RangeIndex(index_len)
     89         new_query_compiler = PandasQueryCompiler(

~/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/worker.py in get(object_ids, worker)
   2209                 # here.
   2210                 last_task_error_raise_time = time.time()
-> 2211                 raise value
   2212             return value
   2213

RayTaskError: ray_worker:modin.engines.ray.pandas_on_ray.io._read_parquet_columns() (pid=15254, host=myhostname)
  File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/modin/engines/ray/pandas_on_ray/io.py", line 702, in _read_parquet_columns
    df = pq.read_pandas(path, columns=columns, **kwargs).to_pandas()
  File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 1128, in read_pandas
    use_pandas_metadata=True)
  File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 1107, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/filesystem.py", line 181, in read_parquet
    use_pandas_metadata=use_pandas_metadata)
  File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 973, in read
    use_pandas_metadata=use_pandas_metadata)
  File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 529, in read
    table = reader.read(**options)
  File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 210, in read
    columns, use_pandas_metadata=use_pandas_metadata)
  File "/home/username/miniconda3/envs/modinp36/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/parquet.py", line 259, in _get_column_indices
    indices += map(self.reader.column_name_idx, index_columns)
  File "pyarrow/_parquet.pyx", line 771, in pyarrow._parquet.ParquetReader.column_name_idx
TypeError: unhashable type: 'dict'

Issue Analytics

State:
Created 4 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

devin-petersohncommented, Apr 16, 2019

Hi @Shellcat-Zero, thanks that helps! I am still not able to reproduce the error, either on master or on Modin 0.4, but this does narrow down the issue.

If you change the pandas part to:

import ray
import pandas as pd

raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df.to_parquet('df.parquet.gzip',compression='gzip')

Does that work if read into Modin?

What is your local pyarrow version? (Make sure you run in a new interpreter)

import pyarrow as pa
print(pa.__version__)

0reactions

devin-petersohncommented, Jun 1, 2020

Closing this. Feel free to reopen if the discussion should continue or if issue was not resolved.

Top Results From Across the Web

modin pandas read_parquet() failed on ETag KeyError trying ...

This issue on the Modin GitHub tracked support for reading partitioned files with read_parquet in Modin, as you are trying to do here....

pd.read_<file> and I/O APIs - Modin

A number of IO methods default to pandas. We have parallelized read_csv and read_parquet , though many of the remaining methods can be...

Read parquet file error - MATLAB Central - MathWorks

I'm reading parquet files and facing some problems. For comparison the file was read with python using fastparquet with no errors. The file...

geopandas.read_parquet

Supports versions 0.1.0, 0.4.0 of the GeoParquet specification at: opengeospatial/geoparquet. If 'crs' key is not present in the GeoParquet metadata ...

parquet 0.4.1 - Docs.rs

docs.rs failed to build parquet-0.4.1. Please check the build logs for more information. See Builds for ideas on how to fix a failed...