read_parquet is not supported for partitioned parquet
See original GitHub issueSplit from #626
The read_parquet
is not supported for a partitioned data set.
System information
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"
$ conda --version
conda 4.6.14
$ python --version
Python 3.7.3
$ pip --version
pip 19.1 from /home/dlweber/miniconda3/envs/gis-dataprocessing/lib/python3.7/site-packages/pip (python 3.7)
$ pip freeze | grep modin
modin==0.5.0
$ pip freeze | grep pandas
pandas==0.24.2
$ pip freeze | grep numpy
numpy==1.16.3
miniconda3 was used to install most of the sci-py stack, with a pip clause to add modin, e.g.
# environment.yaml
channels:
- conda-forge
- defaults
dependencies:
- python>=3.7
- affine
- configobj
- dask
- numpy
- pandas
- pyarrow
- rasterio
- s3fs
- scikit-learn
- scipy
- shapely
- xarray
- pip
- pip:
- modin
Describe the problem
https://modin.readthedocs.io/en/latest/pandas_supported.html says read_parquet
is supported, but maybe not for partitioned data.
error
File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/backends/pandas/query_compiler.py", line 871, in _full_reduce
mapped_parts = self.data.map_across_blocks(map_func)
File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/engines/base/frame/partition_manager.py", line 209, in map_across_blocks
preprocessed_map_func = self.preprocess_func(map_func)
File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/engines/base/frame/partition_manager.py", line 100, in preprocess_func
return self._partition_class.preprocess_func(map_func)
File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py", line 108, in preprocess_func
return ray.put(func)
File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/worker.py", line 2216, in put
worker.put_object(object_id, value)
File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/worker.py", line 375, in put_object
self.store_and_register(object_id, value)
File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/worker.py", line 309, in store_and_register
self.task_driver_id))
File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/utils.py", line 475, in _wrapper
return orig_attr(*args, **kwargs)
File "pyarrow/_plasma.pyx", line 496, in pyarrow._plasma.PlasmaClient.put
File "pyarrow/serialization.pxi", line 355, in pyarrow.lib.serialize
File "pyarrow/serialization.pxi", line 150, in pyarrow.lib.SerializationContext._serialize_callback
File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle.py", line 952, in dumps
cp.dump(obj)
File "/home/joe/miniconda3/envs/project/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle.py", line 271, in dump
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not pickle object as excessively deep recursion required.
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Parquet Files - Spark 2.4.0 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
Read more >parquet file to include partitioned column in file
In my case the parquet file is to be read by external consumers and they expect the coutryCode column in file. Is there...
Read more >can't read data from partitioned parquet file
You need to execute MSCK REPAIR TABLE <table_name> or ALTER TABLE <table_name> RECOVER PARTITIONS - any of them forces to re-discover data ...
Read more >Using Partition Columns
An ORC or Parquet file contains data columns. To these files you can add partition columns at write time. The data files do...
Read more >Loading Parquet data from Cloud Storage | BigQuery
The Google Cloud console does not support appending to or overwriting partitioned or clustered tables in a load job. Click Advanced options and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@darrenleeweber I’m not too sure where the pickling error is coming from but #632 should fix both errors that you were running into
Resolved by #632