ENH: Use fsspec for reading/writing from/to S3, GCS, Azure Blob, etc.
See original GitHub issueIs your feature request related to a problem?
Currently pandas has some support for S3 and GCS using the pandas.io.{gcs,s3} modules, which are based on S3fs and gcsfs.
It seems like we could easily broaden the support for different filesystems by leveraging the fsspec library (https://pypi.org/project/fsspec/) and its interface implementations (see https://github.com/intake/filesystem_spec/blob/master/fsspec/registry.py for some examples) to read/write files in pandas.
This way, we would also be able to use filesystems such as Azure-based storage systems directly from pandas.
Describe the solution you’d like
I’d like to be able to use the different file systems supported by fsspec in pandas with something like:
import pandas as pd
df1 = pd.read_csv("abfs://my_container/my_file.csv")
df1.to_json("file:///some/local/path.json") # Also works without file:// prefix.
df2 = pd.read_csv("s3://my_bucket/my_file.csv")
...
API breaking implications
In principle, it looks as if we could cover most of the work by adapting get_filepath_or_buffer in pandas/io/common.py to use fsspec. We would of course have to test if fsspec doesn’t break anything compared to the current implementations.
One challenge is that some storage systems require extra arguments (called storage options in fsspec). For example, Azure blob requires the user to pass two storage options (account_name and account_key) to be able to access the storage. We would need to consider how to pass these options to the correct methods, either by (a) setting these options globally for a given type of storage or (b) passing the options through the pd.read_* functions and pd.DataFrame.to_* methods.
Describe alternatives you’ve considered
This seems like a change that would have a small impact, as pandas already uses S3fs and gcsfs, which are both implementations of the broader fsspec interface. It should also provide support for a great number of different filesystems with minimal changes, compared to adding support for each filesystem ourselves. As such, I would think it is the preferred approach.
Another approach would be to add support for each additional filesystem as it comes along. This however would require adding and maintaining code for each filesystem, which seems less preferable.
Additional context
I recently implemented similar wrappers for pandas code at one of our clients, and am therefore somewhat familiar with fsspec etc. I would be happy to see if we can contribute these ideas + code to the pandas project.
Issue Analytics
- State:
- Created 3 years ago
- Comments:26 (19 by maintainers)
Top GitHub Comments
You wrote test, not text 😃
So using
fsspec.open
with a textmode
does this too.I meant like this, which does seem to successfully clean up the file-like and the OpenFile, so long as the file-like has
close()
called. You would need to use a weakref (tof.close
?) to break the reference cycle to clean up whenf
is garbage collected instead.