question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Compatibility with the new Arrow FileSystem implementations

See original GitHub issue

Background

@martindurant as you know, last year we started developing new FileSystem implementations in the Apache Arrow project (https://issues.apache.org/jira/browse/ARROW-767, https://github.com/apache/arrow/pull/4225 is the PR with the initial abstract API, on which you gave feedback). Those developments have some implications for users using fsspec-compatible filesystems, and so as promised, with some delay, opening an issue here to discuss how to handle those implications (and since fsspec currently holds the pyarrow-compatibiliy layer, opening an issue here seems appropriate).

To summarize:

  • The “old” filesystems are available under pyarrow.filesystems (docs). We basically only have a LocalFileSystem and pa.hdfs.HadoopFileSystem as concrete implementations. And in addition, there is the DaskFileSystem which is used by fsspec as base class, see more on that below.
  • The “new” filesystems are available in the pyarrow.fs submodule (docs). Those are python wrappers for the C++ implementations, and currently there are already concrete implementations for local, Hadoop and S3.

So an important difference is that the new filesystems are actual implementations in C++, and pyarrow.fs is only providing wrappers for those. This is done for good reasons: those filesystem are a shared implementation and are used by many different users of the Arrow project (and from C, C++, Python, R, Ruby, …). Further, those filesystems are for example used in the Arrow Datasets project, which enables a bunch of new features in the ParquetDataset reading (and also enabled that you can now actually query a Parquet dataset from R). Those new filesystems have been an important part in moving the Arrow project forward.

But this also means that the filesystem that pyarrow functions expect is no longer an “interface” you can implement, but it actually needs a filesystem that wraps a C++ filesystem. (to be clear: all functionality that already existed before is right now still accepting the old filesystems, only the new pyarrow.dataset module already requires the new filesystems. But long term, we want to move consistently to the new filesystems).

Concretely, this means that the feature of fsspec to automatically provide compatibility with payrrow will no longer work in the future:

if installed, all file-system classes also subclass from pyarrow.filesystem.FileSystem, so can work with any arrow function expecting such an instance

This current compatibility means that eg pyarrow’s parquet.read_table/ParquetDataset work with any fsspec filesystem.


Concrete issues

Ideally, we want to keep compatibility for the existing user base that is using fsspec-based filesystems with pyarrow functionality, while at the same time internally in pyarrow moving completely to our new filesytem implementation. To achieve this, I currently see two (not necessarily mutually exclusive, to be clear) options:

  • Implement a “conversion” for all important fsspec-based filesystems to a pyarrow.fs filesystem object (eg convert a s3fs.S3FileSystem instance with all its configuration into an equivalent pyarrow.fs.S3FileSystem).
    • I suppose this is what we will do for a LocalFileSystem. But for other filesystems, I don’t know how faithful such conversions always can be (eg this might be tricky for things like S3? Can an S3 filesytem be fully encoded/roundtripped in an URI?)
    • This option of course has the pre-condition that we actually support the filesystem in question in pyarrow (which is currently limited, although we plan to expand this).
  • Implement a “pyarrow.fs wrapper for fsspec”, a C++ FileSystem that calls back into a python object for each of its methods (where this python object then could be any fsspec-compatible filesystem).
    • Such a “PythonCallBackFilesystem” would allow that pyarrow can actually use the fsspec-based filesystems without converting them. It would provide easy compatibility (easy for the user, to be clear 😉), at the cost of performance (compared to pyarrow’s native filesystems)
    • We could wrap incoming fsspec-filesystems in pyarrow, or fsspec could use such a class as baseclass when pyarrow is installed similarly as is done now.
    • I didn’t yet investigate the feasibility of this option, but opened ARROW-8766 for it.

There is actually also a third option, and that is that some concrete fsspec implementations start to use one of the new pyarrow filesystems as its base, and then it would also be directly usable in pyarrow (but that’s not my call to make, but up to those individual projects to be clear. For HDFS in fsspec that’s probably what we want to do, though, since its implementation already depends on pyarrow).

As mentioned above, those options are not necessarily mutually exclusive. It might depend on the specific filesystem which option is desirable / possible (and the second option could also be a fallback for the first option if pyarrow doesn’t support the specific file system).

Thoughts on this? About the feasibility of the specific options? Other options?


Note: the above is about the use case of “users can pass an fsspec filesystem to pyarrow”. There is also the use case the other way around of “using pyarrow filesystems where fsspec-compliant filesystems are expected”. For this, an fsspec-compliant wrapper around a pyarrow.fs filesystem is probably useful (and I suppose this is something that could live in pyarrow). For this there is https://issues.apache.org/jira/browse/ARROW-7102 Such a wrapper could also provide a richer API for those users who want that with the pyarrow filesystems (since the current pyarrow.fs filesystems are rather bare-bones)

cc @martindurant @TomAugspurger @pitrou

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:28 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
fhoeringcommented, Sep 11, 2020

incomplete for secured connections

We used it with kerberos authentification on our cluster if you mean that. Worked pretty well. The only thing that didn’t work was viewfs. So we implemented on top on our own.

But anyway nevermind the comment. I was a user of hdfs3 and really liked it, then I moved the old pyarrow fs. It is not that bad, I see the limitations now that I use it. But imo it was all fixable. The new pyarrow FS in Python is really cumbersome from my point of view. That’s why I would like to see the fsspec wrapper as a public API. I would even implement it but I don’t have much free time for this at the moment.

0reactions
jorisvandenbosschecommented, Sep 14, 2020

So what is the status of the new implementation, can it be used with/from fsspec?

As already explained multiple times in this thread: the pyarrow.fs filesystems have a different API not compatible with fsspec (so no, it cannot be used directly with/from fsspec), but ideally someone would write a wrapper to wrap a pyarrow.fs filesystem in an fsspec-compatible object (for which I opened https://issues.apache.org/jira/browse/ARROW-8780, but it’s something that could potentially also live in fsspec I think).

I don’t think there’s a problem leaving the warning there, this will only show up for cases that are explicitly calling HDFS via fsspec.

It’s not only about HDFS, I think? Also the other filesystems inherit from pyarrow.filesystem.FileSystem if pyarrow is installed?

hdfs3 used to be the way to do this (indeed, it was the very first of the fsspec-like implementations). However libhdfs3 (the C++ library it depends on) has proven difficult to maintain, and incomplete for secured connections, which I was not able to solve.

And libhdfs3 is also no longer being maintained, AFAIK. Also pyarrow dropped the optional libhdfs3 driver support and the new pyarrow.fs.HadoopFileSystem only supports the JNI driver.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Compatibility with the new Arrow FileSystem implementations
So an important difference is that the new filesystems are actual implementations in C++, and pyarrow.fs is only providing wrappers for those.
Read more >
Filesystem Interface — Apache Arrow v10.0.1
PyArrow implements natively a S3 filesystem for S3 compatible storage. The S3FileSystem constructor has several options to configure the S3 connection (e.g. ...
Read more >
Input / output and filesystems — Apache Arrow v2 ... - pitrou.net
Arrow provides a range of C++ interfaces abstracting the concrete details of input / output operations. They operate on streams of untyped binary...
Read more >
Reading and Writing the Apache Parquet Format - enpiar.com
Apache Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files. We have been concurrently...
Read more >
"Apache Arrow and the Future of Data Frames" with Wes ...
Title: Apache Arrow and the Future of Data FramesSpeaker: Wes McKinney, Director, Ursa LabsDate: July 8, 2020ABSTRACTIn this talk I will ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found