question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] persistent filesystem id

See original GitHub issue

fsspec currently uses fs._fs_token that is based on (among a few other things) thread id and args/kwargs passed to the filesystem constructor. This token is used to compare filesystems for caching.

In dvc, we have persistent storage for metadata associated with a particular file on a particular filesystem (e.g. we are saving that path/to/file has an md5 that is 123456 and use fs.checksum() for verification when loading next time). Currently we only store this for local files, but we need to support that for remote files as well, so in our experimentation, we’ve introduced fs_id, which is a hash of a config dictionary that was used for creating a specific fs instance, which is very similar to _fs_token, except that it doesn’t include tid, so it is persistent across sessions.

In the simplest case (and probably by default), fs_id(open for discussion on the naming) could just default to the hash of the whole config, but it is clear that many filesystems could do better than that. For example:

  • for local, it might be something like a hash of b"local". For other scenarios where more granularity is needed, this might be based on /etc/machine-id or alternative to be able to distinguish separate machines. Also, fsspec itself doesn’t distinguish real filesystems mounted to the root fs in the system, so that might also be a useful one to identify (e.g. maybe based on real fsid from fsstab), here PrefixFileSystem might come in handy too.
  • for s3, fs_id should really depend on things like endpoint/region/bucket/etc, and not on user creds.
  • for ssh, fs_id should depend on base url, but not on user creds.

fs_id for a particular filesystem could change its algorithm in the future (e.g. if developers forgot to include some important parameter), so users should be ready for this and have some mechanism to recover from this. E.g. for dvc it will simply mean that old cached metadata will be effectively dropped, which is just a temporary inconvenience to recompute md5s for requested files.

I think this functionality could be handy for fsspec users in scenarios where they want to be able to identify filesystems across sessions and maybe have persistent storage of metadata. We’ve used to use hash of url (e.g. s3://bucket/path) in some cases, but url is not able to account for all important parameters, so a standalone fs_id would be much better. If there is interest in adopting it, we’ll be happy to contribute PRs for it. Would really appreciate any comments/suggestions.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Sep 28, 2021

id is reserved, no?

But yes, I think your understanding is correct.

1reaction
martindurantcommented, Sep 27, 2021

I agree a session-independent hash of a filesystem seems like a reasonable thing to want. Indeed, it is complicated by the fact that specifics (credentials, defaults) may depend on the environment, but even so. Having a method that can be overidden per filesystem would allow, as you say, for the constraining of which information is to be considered part of the filesystem’s identity.

Question: shoudl this be used as the hash for the existing instance cache? Are other state-dependent factors actually useful?

Read more comments on GitHub >

github_iconTop Results From Across the Web

RFC 3530 NFS version 4 Protocol - IETF
With persistent and volatile filehandle types, the server implementation can match the abilities of the filesystem at the server along with the operating ......
Read more >
RFC 5661: Network File System (NFS) Version 4 Minor ...
RFC 5661 NFSv4.1 January 2010 replier is a server, with just the slot table and session ID persisting, any requests the client retries...
Read more >
Kafka 3.3 Documentation
Events with the same event key (e.g., a customer or vehicle ID) are written to the same partition, and Kafka guarantees that any...
Read more >
Managing file systems Red Hat Enterprise Linux 8
Persistently mounting a file system using RHEL System Roles ... The client and server must agree on the NFSv4 mapping domain for ID...
Read more >
Configure the AWS IoT Greengrass core
MQTT message queue for cloud targets. MQTT persistent sessions with AWS IoT Core. Client IDs for MQTT connections with AWS IoT. MQTT port...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found