question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent behavior with fsspec.open with mode='r' vs. mode='rb'

See original GitHub issue

It’s unclear to me how to use fsspec.open as a contextmanager. The behavior of this function seems inconsistent. The inconsistency is somehow related to the mode argument. I will try to illustrate this with an example.

Create a test file

Bypass fsspec completely

fname = 'test.txt'
with open(fname, mode='w') as f:
    f.write('hello')

Open the file with a FileSystem instance

When I open the file via an instantiated FileSystem instance, everything works as I expect

fs = fsspec.implementations.local.LocalFileSystem()
with fs.open(fname, mode='r') as fp:
    print(type(fp))
# -> <class '_io.TextIOWrapper'>
with fs.open(fname, mode='rb') as fp:
    print(type(fp))
# -> <class '_io.BufferedReader'>

The objects yielded to the context manager look like standard open file objects and can be used as such throughout python.

Open the file via fsspec.open

with fsspec.open(fname, mode='r') as fp:
    print(type(fp))
# -> <class '_io.TextIOWrapper'>
with fsspec.open(fname, mode='rb') as fp:
    print(type(fp))
# -> <class 'fsspec.implementations.local.LocalFileOpener'>

With mode='r', the fsspec.open yields object is the same type of object as fs.open. But with mode='rb', we get a LocalFileOpener. ⚠️ this is the key problem. In order to get a BufferedReader, we need an additional context manager!

with fsspec.open(fname) as fp:
    with fp as fp2:
        print(type(fp2))
# -> <class '_io.BufferedReader'>

I can’t figure out what this LocalFileOpener object is. It’s not documented in the API docs. Most importantly for my purposes, xarray can’t figure out what to do with it if I pass it to open_dataset. In contrast, it handles an _io.BufferedReader object fine.

Proposed resolution

I would propose to remove the inconsistency and have with fsspec.open(fname, mode='rb') yield an _io.BufferedReader object. However, I recognize this could be a breaking change for some applications that rely on the LocalFileOpener.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
rabernatcommented, Jun 8, 2022

The resolution to this appears to be actually to not use the context manager. This is what we came up with in https://github.com/google/xarray-beam/issues/49#issuecomment-1146693746

if hasattr(open_file, "open"):
    open_file = open_file.open()
ds = xr.open_dataset(open_file)

This seems to work with whatever fsspec can throw at us.

1reaction
rabernatcommented, May 29, 2022

This issue has been a thorn in our side for a while time, so I did a deep dive. Warning, copious detail ahead ⚠️ 🤓 ⚠️

The following code tests all combinations of the following five different parameters

parameter options
file type netcdf4 / netcdf3
url_type http / local file
fsspec_open_method of = fsspec.open / create an fs and call of = fs.open
xr_open_method call xr.open_dataset(of) / `with of as fp: xr.open_dataset(fp)
engine all possible xarray engines for file type

We test two things:

  • open_status: whether Xarray can open the object
  • pickle_status: whether the resulting dataset can be pickled
code to generate table
from pickle import dumps

import fsspec
from fsspec.implementations.http import HTTPFileSystem
from fsspec.implementations.local import LocalFileSystem
import xarray as xr
import pandas as pd

def open_from_fs(path):
    if path.startswith("http"):
        fs = HTTPFileSystem()
    else:
        fs = LocalFileSystem()
    return fs.open(path)

def open_from_open(path):
    return fsspec.open(path)

def xr_open_directly(thing, **kwargs):
    ds = xr.open_dataset(thing, **kwargs)
    return thing

def xr_open_with_context_manager(thing, **kwargs):
    with thing as fp:
        with xr.open_dataset(fp, **kwargs) as ds:
            return ds
    

files = {
    'netcdf4': 'https://www.unidata.ucar.edu/software/netcdf/examples/OMI-Aura_L2-example.nc',
    'netcdf3': 'https://www.unidata.ucar.edu/software/netcdf/examples/sresa1b_ncar_ccsm3-example.nc'
}

engines = {
    'netcdf4': ('h5netcdf', 'netcdf4', None),
    'netcdf3': ('scipy', 'netcdf4', None)
}

fsspec_open_methods = {
    'fsspec.open': open_from_fs,
    'fs.open': open_from_open
}

xr_open_methods = {
    'direct_open': xr_open_directly,
    'context_manager': xr_open_with_context_manager
}

results = []
columns = ('file type', 'url_type', 'fssepc_open_method', 'xr_open_method', 'engine', 'open_status', 'pickle_status')
for file_type, url in files.items():
    local_path = url.split('/')[-1]
    for url_type, path in zip(('http', 'local'), (url, local_path)):
        for open_method, open_fn in fsspec_open_methods.items():
            for xr_open_method, xr_open_fn in xr_open_methods.items():
                for engine in engines[file_type]:
                    params = (file_type, url_type, open_method, xr_open_method, engine or "None")
                    pickle_status = open_status = ""
                    try:
                        open_file = open_fn(path)
                        ds = xr_open_fn(open_file, engine=engine)
                        open_status = "✅"
                        try:
                            _ = dumps(ds)
                            pickle_status = "✅"
                        except Exception as e1:
                            pickle_status = f"❌ {type(e1).__name__}: {e1}".replace('\n', ' ')
                    except Exception as e2:
                        open_status = f"❌ {type(e2).__name__}: {e2}".replace('\n', ' ')
                        pickle_status = "n/a"
                    finally:
                        open_file.close()
                    results.append(
                        params + (open_status, pickle_status)
                    )
    

df = pd.DataFrame(data=results, columns=columns)
display(df)
df.to_markdown("results.md", index=False, tablefmt="github")
huge table with all the parameters
file type url_type fssepc_open_method xr_open_method engine open_status pickle_status
netcdf4 http fsspec.open direct_open h5netcdf
netcdf4 http fsspec.open direct_open netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine=‘scipy’ or ‘h5netcdf’ n/a
netcdf4 http fsspec.open direct_open None
netcdf4 http fsspec.open context_manager h5netcdf
netcdf4 http fsspec.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine=‘scipy’ or ‘h5netcdf’ n/a
netcdf4 http fsspec.open context_manager None ❌ ValueError: I/O operation on closed file. n/a
netcdf4 http fs.open direct_open h5netcdf ❌ AttributeError: ‘HTTPFile’ object has no attribute ‘fspath n/a
netcdf4 http fs.open direct_open netcdf4 ❌ AttributeError: ‘HTTPFile’ object has no attribute ‘fspath n/a
netcdf4 http fs.open direct_open None ❌ ValueError: did not find a match in any of xarray’s currently installed IO backends [‘netcdf4’, ‘h5netcdf’, ‘scipy’, ‘cfgrib’, ‘pydap’, ‘rasterio’, ‘zarr’]. Consider explicitly selecting one of the installed engines via the engine parameter, or installing additional IO dependencies, see: https://docs.xarray.dev/en/stable/getting-started-guide/installing.html https://docs.xarray.dev/en/stable/user-guide/io.html n/a
netcdf4 http fs.open context_manager h5netcdf ❌ ValueError: I/O operation on closed file. n/a
netcdf4 http fs.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine=‘scipy’ or ‘h5netcdf’ n/a
netcdf4 http fs.open context_manager None ❌ ValueError: I/O operation on closed file. n/a
netcdf4 local fsspec.open direct_open h5netcdf
netcdf4 local fsspec.open direct_open netcdf4
netcdf4 local fsspec.open direct_open None
netcdf4 local fsspec.open context_manager h5netcdf ❌ TypeError: cannot pickle ‘_io.BufferedReader’ object
netcdf4 local fsspec.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine=‘scipy’ or ‘h5netcdf’ n/a
netcdf4 local fsspec.open context_manager None ❌ TypeError: cannot pickle ‘_io.BufferedReader’ object
netcdf4 local fs.open direct_open h5netcdf
netcdf4 local fs.open direct_open netcdf4
netcdf4 local fs.open direct_open None
netcdf4 local fs.open context_manager h5netcdf
netcdf4 local fs.open context_manager netcdf4
netcdf4 local fs.open context_manager None
netcdf3 http fsspec.open direct_open scipy
netcdf3 http fsspec.open direct_open netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine=‘scipy’ or ‘h5netcdf’ n/a
netcdf3 http fsspec.open direct_open None
netcdf3 http fsspec.open context_manager scipy
netcdf3 http fsspec.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine=‘scipy’ or ‘h5netcdf’ n/a
netcdf3 http fsspec.open context_manager None
netcdf3 http fs.open direct_open scipy ❌ AttributeError: ‘HTTPFile’ object has no attribute ‘fspath n/a
netcdf3 http fs.open direct_open netcdf4 ❌ AttributeError: ‘HTTPFile’ object has no attribute ‘fspath n/a
netcdf3 http fs.open direct_open None ❌ ValueError: did not find a match in any of xarray’s currently installed IO backends [‘netcdf4’, ‘h5netcdf’, ‘scipy’, ‘cfgrib’, ‘pydap’, ‘rasterio’, ‘zarr’]. Consider explicitly selecting one of the installed engines via the engine parameter, or installing additional IO dependencies, see: https://docs.xarray.dev/en/stable/getting-started-guide/installing.html https://docs.xarray.dev/en/stable/user-guide/io.html n/a
netcdf3 http fs.open context_manager scipy
netcdf3 http fs.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine=‘scipy’ or ‘h5netcdf’ n/a
netcdf3 http fs.open context_manager None
netcdf3 local fsspec.open direct_open scipy
netcdf3 local fsspec.open direct_open netcdf4
netcdf3 local fsspec.open direct_open None
netcdf3 local fsspec.open context_manager scipy ❌ TypeError: cannot pickle ‘_io.BufferedReader’ object
netcdf3 local fsspec.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine=‘scipy’ or ‘h5netcdf’ n/a
netcdf3 local fsspec.open context_manager None ❌ TypeError: cannot pickle ‘_io.BufferedReader’ object
netcdf3 local fs.open direct_open scipy
netcdf3 local fs.open direct_open netcdf4
netcdf3 local fs.open direct_open None
netcdf3 local fs.open context_manager scipy
netcdf3 local fs.open context_manager netcdf4
netcdf3 local fs.open context_manager None

There is a lot of information in that table. So Let’s focus just on netcdf4 files and the h5netcdf engine (most commonly used in pangeo forge because it can open remote files via h5py).

Always use Context Manager

Always using the context manager (as recommend in https://github.com/fsspec/filesystem_spec/issues/579#issuecomment-1140018018) narrows the results down to

url_type fssepc_open_method open_status pickle_status
http fsspec.open
http fs.open
local fsspec.open ❌ TypeError: cannot pickle ‘_io.BufferedReader’ object
local fs.open

Here see that we can’t use fsspec.open on local files and the context manager, because it puts an _io.BufferedReader object into the dataset which can’t be pickled.

Don’t Use Context Manager

We can disregard the advice of https://github.com/fsspec/filesystem_spec/issues/579#issuecomment-1140018018 and not use the context manager, passing the open_file object directly to xarray. (If this works it is “accidental”.) Then we get the following results

url_type fssepc_open_method open_status pickle_status
http fsspec.open
http fs.open ❌ AttributeError: ‘HTTPFile’ object has no attribute ‘fspath n/a
local fsspec.open
local fs.open

Here we can’t use fs.open where fs is an HTTPFileSystem, because it is not openable by xarray. Ultimately, the error is coming from h5netcdf, as illustrated by the following simple reproducer

url = 'https://www.unidata.ucar.edu/software/netcdf/examples/OMI-Aura_L2-example.nc' # netcdf
open_file = fsspec.open(url)
ds = xr.open_dataset(open_file, engine='h5netcdf')
traceback
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-71c2c6bbd0b7> in <module>
      3 open_file = fsspec.open(url)
      4 #with open_file as fp:
----> 5 ds = xr.open_dataset(open_file, engine='h5netcdf')
      6 #    _ = dumps(ds)

~/Code/xarray/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    493 
    494     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495     backend_ds = backend.open_dataset(
    496         filename_or_obj,
    497         drop_variables=drop_variables,

~/Code/xarray/xarray/backends/h5netcdf_.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, format, group, lock, invalid_netcdf, phony_dims, decode_vlen_strings)
    384     ):
    385 
--> 386         filename_or_obj = _normalize_path(filename_or_obj)
    387         store = H5NetCDFStore.open(
    388             filename_or_obj,

~/Code/xarray/xarray/backends/common.py in _normalize_path(path)
     21 def _normalize_path(path):
     22     if isinstance(path, os.PathLike):
---> 23         path = os.fspath(path)
     24 
     25     if isinstance(path, str) and not is_remote_uri(path):

~/Code/filesystem_spec/fsspec/core.py in __fspath__(self)
     96     def __fspath__(self):
     97         # may raise if cannot be resolved to local file
---> 98         return self.open().__fspath__()
     99 
    100     def __enter__(self):

AttributeError: 'HTTPFile' object has no attribute '__fspath__'

Why does this matter?

In Pangeo Forge, we need to open Xarray datasets and serialize them starting from all four of the following type of objects

  • open_file = fsspec.open(local_path) -> <OpenFile '/local/path.nc'>
    • ❌ doesn’t work with context manager (TypeError: cannot pickle ‘_io.BufferedReader’ object)
    • ✅ works without context manager
  • open_file = fsspec.open(http_path) -> <OpenFile 'https://path.nc'>
    • ✅ works with context manager
    • ✅ works without context manager
  • open_file = fs_local.open(local_path) -> <fsspec.implementations.local.LocalFileOpener at 0x18425a880>
    • ✅ works with context manager
    • ✅ works without context manager
  • open_file = fs_http.open(http_path) -> <File-like object HTTPFileSystem, https://path.nc>
    • ✅ works with context manager
    • ❌ doesn’t work without context manager (AttributeError: ‘HTTPFile’ object has no attribute ‘fspath’)

As this list shows, we can’t just decide to always use a context manager or never use a context manager. The only viable option today is what I do in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/370/files#r884065015: use try / except logic to find something that works. I would prefer instead to have a stronger contract with fsspec.

What is the core problem in fsspec?

IMO the core issue is the fact that fsspec.open and fs.open do different things, and that the resulting objects do not have consistent behavior with xarray / h5netcdf. I wish that fsspec.open would give me back the exact same type of thing that fs.open does. Instead, fsspec.open creates fsspec.core.OpenFile objects, fs_local.open creates fsspec.implementations.local.LocalFileOpener , and fs_http.open creates fsspec.implementations.http.HTTPFile. Is an FileOpener supposed to behave the same as OpenFile? Which of the two APIs does HTTPFile implement? The situation is highly confusing.

We can continue working around this in Pangeo Forge, but having to try / except around this makes Pangeo Forge more fragile; what if a different, legitimate i/o error comes up inside the try block? It would be hard to detect. Furthermore, based on this analysis, we can only assume that more inconsistent behavior is lurking in all of the other implementations (gcsfs, s3fs, ftp, etc.), waiting to trip us up.

I hope I have managed to articulate this problem. Martin, would really appreciate your advice on where to go from here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Inconsistent behavior with fsspec.open with mode='r ... - GitHub
I would propose to remove the inconsistency and have with fsspec.open(fname, mode='rb') yield an _io.BufferedReader object. However, I recognize this could be a ......
Read more >
Features of fsspec - Read the Docs
OpenFile class allows for the opening of files on a binary store, which appear to be in text mode and/or allow for a...
Read more >
Technical Note TN1150: HFS Plus Volume Format
This Technote describes the on-disk format for an HFS Plus volume. It does not describe any programming interfaces for HFS Plus volumes.
Read more >
History - Collection of Repositories - Python.org
Issue #1342: On windows, Python could not start when installed in a directory with non-ascii characters. - Implement PEP 3121: new module initialization...
Read more >
powerful Python data analysis toolkit - Pandas
majority of methods produce new objects and leave the input data untouched. ... To load the pandas package and start working with it, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found