Inconsistent behavior with fsspec.open with mode='r' vs. mode='rb'
See original GitHub issueIt’s unclear to me how to use fsspec.open
as a contextmanager. The behavior of this function seems inconsistent. The inconsistency is somehow related to the mode
argument. I will try to illustrate this with an example.
Create a test file
Bypass fsspec completely
fname = 'test.txt'
with open(fname, mode='w') as f:
f.write('hello')
Open the file with a FileSystem instance
When I open the file via an instantiated FileSystem instance, everything works as I expect
fs = fsspec.implementations.local.LocalFileSystem()
with fs.open(fname, mode='r') as fp:
print(type(fp))
# -> <class '_io.TextIOWrapper'>
with fs.open(fname, mode='rb') as fp:
print(type(fp))
# -> <class '_io.BufferedReader'>
The objects yielded to the context manager look like standard open file objects and can be used as such throughout python.
Open the file via fsspec.open
with fsspec.open(fname, mode='r') as fp:
print(type(fp))
# -> <class '_io.TextIOWrapper'>
with fsspec.open(fname, mode='rb') as fp:
print(type(fp))
# -> <class 'fsspec.implementations.local.LocalFileOpener'>
With mode='r'
, the fsspec.open
yields object is the same type of object as fs.open
. But with mode='rb'
, we get a LocalFileOpener
. ⚠️ this is the key problem. In order to get a BufferedReader
, we need an additional context manager!
with fsspec.open(fname) as fp:
with fp as fp2:
print(type(fp2))
# -> <class '_io.BufferedReader'>
I can’t figure out what this LocalFileOpener
object is. It’s not documented in the API docs. Most importantly for my purposes, xarray can’t figure out what to do with it if I pass it to open_dataset
. In contrast, it handles an _io.BufferedReader
object fine.
Proposed resolution
I would propose to remove the inconsistency and have with fsspec.open(fname, mode='rb')
yield an _io.BufferedReader
object. However, I recognize this could be a breaking change for some applications that rely on the LocalFileOpener
.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (13 by maintainers)
The resolution to this appears to be actually to not use the context manager. This is what we came up with in https://github.com/google/xarray-beam/issues/49#issuecomment-1146693746
This seems to work with whatever fsspec can throw at us.
This issue has been a thorn in our side for a while time, so I did a deep dive. Warning, copious detail ahead ⚠️ 🤓 ⚠️
The following code tests all combinations of the following five different parameters
file type
url_type
fsspec_open_method
of = fsspec.open
/ create anfs
and callof = fs.open
xr_open_method
xr.open_dataset(of)
/ `with of as fp: xr.open_dataset(fp)engine
We test two things:
open_status
: whether Xarray can open the objectpickle_status
: whether the resulting dataset can be pickledcode to generate table
huge table with all the parameters
engine
parameter, or installing additional IO dependencies, see: https://docs.xarray.dev/en/stable/getting-started-guide/installing.html https://docs.xarray.dev/en/stable/user-guide/io.htmlengine
parameter, or installing additional IO dependencies, see: https://docs.xarray.dev/en/stable/getting-started-guide/installing.html https://docs.xarray.dev/en/stable/user-guide/io.htmlThere is a lot of information in that table. So Let’s focus just on netcdf4 files and the
h5netcdf
engine (most commonly used in pangeo forge because it can open remote files via h5py).Always use Context Manager
Always using the context manager (as recommend in https://github.com/fsspec/filesystem_spec/issues/579#issuecomment-1140018018) narrows the results down to
Here see that we can’t use
fsspec.open
on local files and the context manager, because it puts an_io.BufferedReader
object into the dataset which can’t be pickled.Don’t Use Context Manager
We can disregard the advice of https://github.com/fsspec/filesystem_spec/issues/579#issuecomment-1140018018 and not use the context manager, passing the
open_file
object directly to xarray. (If this works it is “accidental”.) Then we get the following resultsHere we can’t use
fs.open
wherefs
is anHTTPFileSystem
, because it is not openable by xarray. Ultimately, the error is coming from h5netcdf, as illustrated by the following simple reproducertraceback
Why does this matter?
In Pangeo Forge, we need to open Xarray datasets and serialize them starting from all four of the following type of objects
open_file = fsspec.open(local_path)
-><OpenFile '/local/path.nc'>
open_file = fsspec.open(http_path)
-><OpenFile 'https://path.nc'>
open_file = fs_local.open(local_path)
-><fsspec.implementations.local.LocalFileOpener at 0x18425a880>
open_file = fs_http.open(http_path)
-><File-like object HTTPFileSystem, https://path.nc>
As this list shows, we can’t just decide to always use a context manager or never use a context manager. The only viable option today is what I do in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/370/files#r884065015: use try / except logic to find something that works. I would prefer instead to have a stronger contract with fsspec.
What is the core problem in fsspec?
IMO the core issue is the fact that
fsspec.open
andfs.open
do different things, and that the resulting objects do not have consistent behavior with xarray / h5netcdf. I wish thatfsspec.open
would give me back the exact same type of thing thatfs.open
does. Instead,fsspec.open
createsfsspec.core.OpenFile
objects,fs_local.open
createsfsspec.implementations.local.LocalFileOpener
, andfs_http.open
createsfsspec.implementations.http.HTTPFile
. Is anFileOpener
supposed to behave the same asOpenFile
? Which of the two APIs doesHTTPFile
implement? The situation is highly confusing.We can continue working around this in Pangeo Forge, but having to try / except around this makes Pangeo Forge more fragile; what if a different, legitimate i/o error comes up inside the try block? It would be hard to detect. Furthermore, based on this analysis, we can only assume that more inconsistent behavior is lurking in all of the other implementations (gcsfs, s3fs, ftp, etc.), waiting to trip us up.
I hope I have managed to articulate this problem. Martin, would really appreciate your advice on where to go from here.