question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_parquet is not supported for partitioned parquet

See original GitHub issue

Split from #626

The read_parquet is not supported for a partitioned data set.

System information

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
$ conda --version
conda 4.6.14
$ python --version
Python 3.7.3
$ pip --version
pip 19.1 from /home/dlweber/miniconda3/envs/gis-dataprocessing/lib/python3.7/site-packages/pip (python 3.7)

$ pip freeze | grep modin
modin==0.5.0
$ pip freeze | grep pandas
pandas==0.24.2
$ pip freeze | grep numpy
numpy==1.16.3

miniconda3 was used to install most of the sci-py stack, with a pip clause to add modin, e.g.

# environment.yaml
channels:
  - conda-forge
  - defaults

dependencies:
  - python>=3.7
  - affine
  - configobj
  - dask
  - numpy
  - pandas
  - pyarrow
  - rasterio
  - s3fs
  - scikit-learn
  - scipy
  - shapely
  - xarray
  - pip
  - pip:
    - modin

Describe the problem

https://modin.readthedocs.io/en/latest/pandas_supported.html says read_parquet is supported, but maybe not for partitioned data.

error

  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/backends/pandas/query_compiler.py", line 871, in _full_reduce
    mapped_parts = self.data.map_across_blocks(map_func)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/engines/base/frame/partition_manager.py", line 209, in map_across_blocks
    preprocessed_map_func = self.preprocess_func(map_func)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/engines/base/frame/partition_manager.py", line 100, in preprocess_func
    return self._partition_class.preprocess_func(map_func)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py", line 108, in preprocess_func
    return ray.put(func)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/worker.py", line 2216, in put
    worker.put_object(object_id, value)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/worker.py", line 375, in put_object
    self.store_and_register(object_id, value)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/worker.py", line 309, in store_and_register
    self.task_driver_id))
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/utils.py", line 475, in _wrapper
    return orig_attr(*args, **kwargs)
  File "pyarrow/_plasma.pyx", line 496, in pyarrow._plasma.PlasmaClient.put
  File "pyarrow/serialization.pxi", line 355, in pyarrow.lib.serialize
  File "pyarrow/serialization.pxi", line 150, in pyarrow.lib.SerializationContext._serialize_callback
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle.py", line 952, in dumps
    cp.dump(obj)
  File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle.py", line 271, in dump
    raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
williamma12commented, May 23, 2019

@darrenleeweber I’m not too sure where the pickling error is coming from but #632 should fix both errors that you were running into

0reactions
devin-petersohncommented, May 28, 2019

Resolved by #632

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parquet Files - Spark 2.4.0 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
Read more >
parquet file to include partitioned column in file
In my case the parquet file is to be read by external consumers and they expect the coutryCode column in file. Is there...
Read more >
can't read data from partitioned parquet file
You need to execute MSCK REPAIR TABLE <table_name> or ALTER TABLE <table_name> RECOVER PARTITIONS - any of them forces to re-discover data ...
Read more >
Using Partition Columns
An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do...
Read more >
Loading Parquet data from Cloud Storage | BigQuery
The Google Cloud console does not support appending to or overwriting partitioned or clustered tables in a load job. Click Advanced options and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found