question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: Use fsspec for reading/writing from/to S3, GCS, Azure Blob, etc.

See original GitHub issue

Is your feature request related to a problem?

Currently pandas has some support for S3 and GCS using the pandas.io.{gcs,s3} modules, which are based on S3fs and gcsfs.

It seems like we could easily broaden the support for different filesystems by leveraging the fsspec library (https://pypi.org/project/fsspec/) and its interface implementations (see https://github.com/intake/filesystem_spec/blob/master/fsspec/registry.py for some examples) to read/write files in pandas.

This way, we would also be able to use filesystems such as Azure-based storage systems directly from pandas.

Describe the solution you’d like

I’d like to be able to use the different file systems supported by fsspec in pandas with something like:

import pandas as pd

df1 = pd.read_csv("abfs://my_container/my_file.csv")
df1.to_json("file:///some/local/path.json") # Also works without file:// prefix.

df2 = pd.read_csv("s3://my_bucket/my_file.csv")
...

API breaking implications

In principle, it looks as if we could cover most of the work by adapting get_filepath_or_buffer in pandas/io/common.py to use fsspec. We would of course have to test if fsspec doesn’t break anything compared to the current implementations.

One challenge is that some storage systems require extra arguments (called storage options in fsspec). For example, Azure blob requires the user to pass two storage options (account_name and account_key) to be able to access the storage. We would need to consider how to pass these options to the correct methods, either by (a) setting these options globally for a given type of storage or (b) passing the options through the pd.read_* functions and pd.DataFrame.to_* methods.

Describe alternatives you’ve considered

This seems like a change that would have a small impact, as pandas already uses S3fs and gcsfs, which are both implementations of the broader fsspec interface. It should also provide support for a great number of different filesystems with minimal changes, compared to adding support for each filesystem ourselves. As such, I would think it is the preferred approach.

Another approach would be to add support for each additional filesystem as it comes along. This however would require adding and maintaining code for each filesystem, which seems less preferable.

Additional context

I recently implemented similar wrappers for pandas code at one of our clients, and am therefore somewhat familiar with fsspec etc. I would be happy to see if we can contribute these ideas + code to the pandas project.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:26 (19 by maintainers)

github_iconTop GitHub Comments

2reactions
martindurantcommented, Apr 10, 2020

You wrote test, not text 😃

So using fsspec.open with a text mode does this too.

0reactions
martindurantcommented, Apr 17, 2020

I meant like this, which does seem to successfully clean up the file-like and the OpenFile, so long as the file-like has close() called. You would need to use a weakref (to f.close?) to break the reference cycle to clean up when f is garbage collected instead.

--- a/fsspec/core.py
+++ b/fsspec/core.py
@@ -72,6 +73,7 @@ class OpenFile(object):
         self.errors = errors
         self.newline = newline
         self.fobjects = []
+        self.ref_close = None

     def __reduce__(self):
         return (
@@ -126,10 +128,16 @@ class OpenFile(object):
         The file should be explicitly closed to avoid enclosed open file
         instances persisting
         """
-        return self.__enter__()
+        f = self.__enter__()
+        self.ref_close = f.close
+        f.close = self.close
+        return f

     def close(self):
         """Close all encapsulated file objects"""
+        if self.ref_close:
+            self.fobjects[-1].close = self.ref_close
+            self.ref_close = None
Read more comments on GitHub >

github_iconTop Results From Across the Web

Features of fsspec - Read the Docs
OpenFile() class provides a convenient way to prescribe the manner to open some file (local, remote, in a compressed store, etc.) which is...
Read more >
fsspec Documentation - Read the Docs
You can use fsspec's file objects with any python function that accepts ... data in S3, you can use the optional pip install...
Read more >
FileSystems Integration for cloud storages - Hugging Face
S3FileSystem, which is a known implementation of fsspec . ... Listing files from a private s3 bucket using aws_access_key_id and aws_secret_access_key .
Read more >
Packages beginning with letter P - RPMFind
perl-App-GitHub-update-0.1.100-9.mga9, Update a github repository (description, homepage, etc.) from the commandline, linux/noarch.
Read more >
Why does fsspec always connects to blob? - Microsoft Q&A
FSSPEC can read/write ADLS data by specifying the linked service name. In Synapse studio, open Data > Linked > Azure Data Lake Storage...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found