Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: support `storage_options` argument in `read_parquet`

See original GitHub issue

Is your feature request related to a problem?

I store lots of data in a quilt bucket (i.e. S3 storage) and use s3fs with geopandas to read data directly from the wire, like

gpd.read_parquet("s3://spatial-ucr/census/administrative/counties.parquet")

often, that works perfectly. But depending on the botocore/sf3fs/aiobotocore/fsspec version collection, it can throw botocore.exceptions.NoCredentialsError: Unable to locate credentials.

Describe the solution you’d like

the pandas version of read_parquet supports passing storage_options={"anon": True} which I believe will get around that particular error, but in geopandas that argument fails with TypeError: read_table() got an unexpected keyword argument 'storage_options'. It would be great if gpd.read_parquet would allow me to pass that arg as well.

API breaking implications

None

Describe alternatives you’ve considered

I could probably read the file directly with pandas, then convert the serialized geometry column myself, but that would skirt the nice efficient implementation already in the geopandas version of read_parquet 😃

Issue Analytics

State:
Created 2 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

jorisvandenbosschecommented, Aug 27, 2021

And on the original topic: I think it’s a good idea to add support for the storage_options keyword (there are other aspects you might want to tweak, like the region, or endpoint, etc).

Athough it’s in theory superfluous with passing an actual filesystem object (and you can create an s3fs filesystem with those same storage_options, pyarrow will accept a s3fs filesystem as well), it gives consistency with pandas and dask (and dask-geopandas).

Implementation wise, I think we can do something like:

if storage_options is not None:
    if filesytem is not None: 
        raise error
    filesystem, _, path = fsspec.get_fs_token_paths(path)

0reactions

jorisvandenbosschecommented, Aug 27, 2021

[on passing a filesystem object as work-around] However, in my env this doesn’t work on AWS Error [code 100]: No response body. while reading directly with no filesystem specification works.

One guess: it might be that if you pass an explicit filesystem object, you need to leave out the s3:// from the file path

Top Results From Across the Web

Parquet Files - Spark 3.3.1 Documentation

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet...

pandas.read_parquet — pandas 1.5.2 documentation

Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/ ...

Dask Read Parquet Files into DataFrames with read_parquet

This blog post explains how to read Parquet files into Dask DataFrames. Parquet is a columnar, binary file format that has multiple ...

Pandas cannot read parquet files created in PySpark

The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives ......

Loading Parquet data from Cloud Storage | BigQuery

This page provides an overview of loading Parquet data from Cloud Storage into BigQuery. Parquet is an open source column-oriented data format that...