question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HTTP backend has flaky behavior when enumerating the directory index

See original GitHub issue

Dear @martindurant, @TomAugspurger and contributors,

first things first: Thanks a stack for conceiving and maintaining this excellent library.

@gutzbenj is currently migrating the data acquisition backend of wetterdienst to use fsspec for downloading data from the HTTP server at https://opendata.dwd.de/. For ingesting the directory index, we are currently using a hand-rolled implementation [1] based on BeautifulSoup and lxml.

Now that we are switching to fsspec, we found that it would randomly include a single folder within its results list when enumerating a remote HTTP filesystem using fs.find(url). It is expected that it would only return files as results, right? We added a concise repro at [2].

The respective folder is not always the same, so the behavior is flaky. It can be revealed by invoking the examples [1] vs. [2] by using:

# Run current implementation.
python list-remote-files-dwd.py | grep -v zip

# Run implementation based on fsspec.
python fsspec-dwd.py | grep -v zip

Alternatively, this produces a deterministic yet wrong result when counting the number of enumerated files:

python list-remote-files-dwd.py | wc -l
  194595
python fsspec-dwd.py | wc -l
  194596

The result list count of the fsspec-based implementation is off by one.

Thank you in advance for taking the time to look into this.

Keep up the spirit and with kind regards, Andreas.

[1] https://gist.github.com/gutzbenj/88e69d10447698d099f7227a731add9b [2] https://gist.github.com/amotl/9fc67b696cbab9f0667be60de4dcf2be

P.S.: We are using Python 3.8 and fsspec 0.8.5.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Dec 21, 2020

let’s wait for him to come back providing a patch

Certainly.

I will comment that, as I mentioned, there is not a concrete way to say if a URL is meant to be a directory or a file. Currently, isfile says True for any URL that is reachable (which means calling this server, every time), but in this context is might be reasonable to replace with

    async def _isfile(self, path, **kwargs):
        return path.endswith("/")
0reactions
gutzbenjcommented, Dec 21, 2020

Dear Martin,

I have handed in a small patch with #507 . What we’ve seen is resulting from a variable naming in the find() method. I simply added an underscore to the path variable (-> path_) coming from self.walk() to supress this behaviour.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Bug listing with status RESOLVED with resolution FIXED as at ...
Bug listing with status RESOLVED with resolution FIXED as at 2022/12/17 06:46:03.
Read more >
ingress-nginx/Changelog.md at main - GitHub
Kubernetes Registry change notice The @kubernetesio container image host http://k8s.gcr.io is actually getting redirected to the community controlled ...
Read more >
Troubleshooting Git - GitLab Docs
Here are some tips on troubleshooting and resolving issues with Git. Broken pipe errors on git push. 'Broken pipe' errors can occur when...
Read more >
Known Bugs - curl
2.15 Renegotiate from server may cause hang for OpenSSL backend. A race condition has been observed when, immediately after the initial handshake, curl...
Read more >
smb.conf - Samba.org
The smb.conf file is a configuration file for the Samba suite. smb.conf ... seen was an error of "Abort Retry Ignore" at the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found