question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: support `storage_options` argument in `read_parquet`

See original GitHub issue

Is your feature request related to a problem?

I store lots of data in a quilt bucket (i.e. S3 storage) and use s3fs with geopandas to read data directly from the wire, like

gpd.read_parquet("s3://spatial-ucr/census/administrative/counties.parquet")

often, that works perfectly. But depending on the botocore/sf3fs/aiobotocore/fsspec version collection, it can throw botocore.exceptions.NoCredentialsError: Unable to locate credentials.

Describe the solution you’d like

the pandas version of read_parquet supports passing storage_options={"anon": True} which I believe will get around that particular error, but in geopandas that argument fails with TypeError: read_table() got an unexpected keyword argument 'storage_options'. It would be great if gpd.read_parquet would allow me to pass that arg as well.

API breaking implications

None

Describe alternatives you’ve considered

I could probably read the file directly with pandas, then convert the serialized geometry column myself, but that would skirt the nice efficient implementation already in the geopandas version of read_parquet 😃

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jorisvandenbosschecommented, Aug 27, 2021

And on the original topic: I think it’s a good idea to add support for the storage_options keyword (there are other aspects you might want to tweak, like the region, or endpoint, etc).

Athough it’s in theory superfluous with passing an actual filesystem object (and you can create an s3fs filesystem with those same storage_options, pyarrow will accept a s3fs filesystem as well), it gives consistency with pandas and dask (and dask-geopandas).

Implementation wise, I think we can do something like:

if storage_options is not None:
    if filesytem is not None: 
        raise error
    filesystem, _, path = fsspec.get_fs_token_paths(path)
0reactions
jorisvandenbosschecommented, Aug 27, 2021

[on passing a filesystem object as work-around] However, in my env this doesn’t work on AWS Error [code 100]: No response body. while reading directly with no filesystem specification works.

One guess: it might be that if you pass an explicit filesystem object, you need to leave out the s3:// from the file path

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet...
Read more >
pandas.read_parquet — pandas 1.5.2 documentation
Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/ ...
Read more >
Dask Read Parquet Files into DataFrames with read_parquet
This blog post explains how to read Parquet files into Dask DataFrames. Parquet is a columnar, binary file format that has multiple ...
Read more >
Pandas cannot read parquet files created in PySpark
The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives ......
Read more >
Loading Parquet data from Cloud Storage | BigQuery
This page provides an overview of loading Parquet data from Cloud Storage into BigQuery. Parquet is an open source column-oriented data format that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found