question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SFTPFile(AbstractBufferedFile) for sparse access to remote file over ssh

See original GitHub issue

Ref: https://github.com/fsspec/filesystem_spec/issues/748 which I found while also wondering about AttributeError: 'SFTPFile' object has no attribute 'blocksize' error 😉

I guess there is no “sparse” cache due to not implemented “range” request support for sftp as it is done for e.g. HTTPFile?

But it seems that sftp itself does allow for range requests, e.g.:

$> curl --silent --range 0-0,-1 sftp://yoh@secret.datalad.org:2222/home/yoh/c4057c5e-7af5-4370-878f-ccfc971aeba4 | hexdump
0000000 0089                                   
0000001

so I guess it should be well be possible provide such support in fsspec… didn’t look in detail yet anywhere but paramiko does seems to support the seekable BufferedFile

$> git grep -p 'def seek'
paramiko/_winapi.py=class MemoryMap(object):
paramiko/_winapi.py:    def seek(self, pos):
paramiko/file.py=class BufferedFile(ClosingContextManager):
paramiko/file.py:    def seekable(self):
paramiko/file.py:    def seek(self, offset, whence=0):
paramiko/sftp_file.py=class SFTPFile(BufferedFile):
paramiko/sftp_file.py:    def seekable(self):
paramiko/sftp_file.py:    def seek(self, offset, whence=0):

so may be it is really just a quick patch away? 😉

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
yarikopticcommented, Aug 4, 2022

Hi @efiop! Indeed long time. I really hope you are doing ok.

Ran into https://github.com/ronf/asyncssh/issues/504 so had to move aside my ~/.ssh and tune up my ugly quick script to ask for password

The script with which I am "exploring" the fsspec on a type of files in interest
import sys
import fsspec

# if to see what is going on!
#import logging
#logging.getLogger("fsspec").setLevel(1)
#logging.basicConfig(level=1)

import pynwb
import h5py
import urllib

from time import time

if '://' in sys.argv[-1]:
    url = sys.argv[-1]
else:
    # and 11GB file to munch on. Would fetch about 80MB of data
    url = "https://dandiarchive.s3.amazonaws.com/blobs/bb8/1f7/bb81f7b3-4cfa-40e7-aa89-95beb1954d8c?versionId=F33RzmXlfGyL4rcwBIBenrW2eqDSr4qZ"

# figure out filesystem, lets add some mappings
fsmap = {'https': 'http'}

#In [8]: urllib.parse.urlparse('ssh://lkasjdf')
#Out[8]: ParseResult(scheme='ssh', netloc='lkasjdf', path='', params='', query='', fragment='')

url_ = urllib.parse.urlparse(url)
scheme = url_.scheme
scheme = fsmap.get(scheme, scheme)

if scheme == 'http':
    # for http -- everything goes into fs.open
    fspath = url
    fskw = {}
elif scheme in ('ssh', 'sshfs'):
    fspath = url_.path.lstrip('/') # consider it from $HOME for now
    import getpass
    fskw = dict(
        host=url_.netloc.split(':', 1)[0],
        port=int(url_.netloc.split(':', 1)[1]) if ':' in url_.netloc else 22,
        # cannot use keys so will demand password
        password=getpass.getpass("Password:"),
    )
else:
    raise NotImplementedError(f"Do not know how to handle {scheme}")

if scheme == 'sshfs':
    from sshfs import SSHFileSystem
    fs = SSHFileSystem(**fskw)
else:
    fs = fsspec.filesystem(scheme, **fskw)

from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(
    fs=fs,
    # target_protocol='blockcache',
    cache_storage="nwb-cache",
    # cache_check=600,
    # block_size=1024,
    # check_files=True,
    # expiry_times=True,
    # same_names=True
)


print(f"Accessing {url} as {fspath} on {fs} ")
# It is crucial to have proper context manager to cache gets closed so it gets reused
# and https://docs.h5py.org/en/stable/high/file.html also has Warning that things
# should be closed in proper order!  Seems to be crucial.
t0 = time()
with fs.open(fspath, 'rb') as f:
    with h5py.File(f) as file:
        with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io:
            out = io.read()
            print(f"Read something which gives me {len(str(out))} long str representation in {time()-t0:.3f} sec")

and running it results in

$> time python -Wignore cached-fsspec.py sshfs://secret.datalad.org:some/c4057c5e-7af5-4370-878f-ccfc971aeba4
Password:
Accessing sshfs://secret.datalad.org:some/c4057c5e-7af5-4370-878f-ccfc971aeba4 as c4057c5e-7af5-4370-878f-ccfc971aeba4 on <fsspec.implementations.cached.CachingFileSystem object at 0x7f6bf6aadfc0> 
Traceback (most recent call last):
  File "/home/yoh/proj/dandi/trash/cached-fsspec.py", line 71, in <module>
    with fs.open(fspath, 'rb') as f:
  File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py", line 406, in <lambda>
    return lambda *args, **kw: getattr(type(self), item).__get__(self)(
  File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/spec.py", line 1034, in open
    f = self._open(
  File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py", line 406, in <lambda>
    return lambda *args, **kw: getattr(type(self), item).__get__(self)(
  File "/home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py", line 342, in _open
    f.cache = MMapCache(f.blocksize, f._fetch_range, f.size, fn, blocks)
AttributeError: 'SSHFile' object has no attribute '_fetch_range'


^C
python -Wignore cached-fsspec.py   2.53s user 0.18s system 1% cpu 2:57.46 total

so it stalls after traceback and requires Ctrl-C it. If you need a sample of that file to try exactly that script – it is this one https://dandiarchive.s3.amazonaws.com/blobs/c40/57c/c4057c5e-7af5-4370-878f-ccfc971aeba4 . versions of fsspec and sshfs and asyncssh AFAIK are all “bleeding edge from github”. FWIW:

> /home/yoh/deb/gits/pkg-exppsy/pynwb-upstream/venvs/dev3/lib/python3.10/site-packages/fsspec/implementations/cached.py(342)_open()
-> f.cache = MMapCache(f.blocksize, f._fetch_range, f.size, fn, blocks)
(Pdb) p f
<sshfs.file.SSHFile object at 0x7fec96775240>
(Pdb) p f.__module__
'sshfs.file'
(Pdb) p dir(f)
['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_impl', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_close', '_closed', '_file', '_open_file', 'blocksize', 'close', 'closed', 'fileno', 'flush', 'fs', 'fsync', 'isatty', 'kwargs', 'loop', 'max_requests', 'mode', 'path', 'read', 'readable', 'readline', 'readlines', 'seek', 'seekable', 'tell', 'truncate', 'writable', 'write', 'writelines']
1reaction
yarikopticcommented, Aug 4, 2022

BTW – just to make sure – having installed (and imported) sshfs does not automagically make sshfs known for fsspec.registry.known_implementations and available for fsspec.filesystem right? (I would have assumed it would through some entrypoint or alike)

Read more comments on GitHub >

github_iconTop Results From Across the Web

How To Use SFTP to Securely Transfer Files with a Remote ...
SFTP is a secure way to transfer files between local and remote servers. ... Test SSH access with the following command:.
Read more >
API Reference — fsspec 2022.11.0+13.g0974514.dirty ...
This class implements chunk-wise local storage of remote files, for quick access after the initial download. The files are stored in a given...
Read more >
How to allow SSH session to linux server but, do not allow ...
We allow SSH access to admins along with SFTP file transfer access using PAM. Now unique case is to grant SSH session to...
Read more >
Using SFTP for Remote File Transfer from the Command Line
The SSH File Transfer Protocol allows you to transfer files from the command line via SSH between a local computer and a specified...
Read more >
Net::SFTP::Foreign - SSH File Transfer Protocol client
The security in SFTP comes through its integration with SSH, ... Blocks that are all zeros are skipped possibly creating an sparse file...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found