HTTP backend has flaky behavior when enumerating the directory index
See original GitHub issueDear @martindurant, @TomAugspurger and contributors,
first things first: Thanks a stack for conceiving and maintaining this excellent library.
@gutzbenj is currently migrating the data acquisition backend of wetterdienst to use fsspec
for downloading data from the HTTP server at https://opendata.dwd.de/. For ingesting the directory index, we are currently using a hand-rolled implementation [1] based on BeautifulSoup and lxml.
Now that we are switching to fsspec, we found that it would randomly include a single folder within its results list when enumerating a remote HTTP filesystem using fs.find(url)
. It is expected that it would only return files as results, right? We added a concise repro at [2].
The respective folder is not always the same, so the behavior is flaky. It can be revealed by invoking the examples [1] vs. [2] by using:
# Run current implementation.
python list-remote-files-dwd.py | grep -v zip
# Run implementation based on fsspec.
python fsspec-dwd.py | grep -v zip
Alternatively, this produces a deterministic yet wrong result when counting the number of enumerated files:
python list-remote-files-dwd.py | wc -l
194595
python fsspec-dwd.py | wc -l
194596
The result list count of the fsspec-based implementation is off by one.
Thank you in advance for taking the time to look into this.
Keep up the spirit and with kind regards, Andreas.
[1] https://gist.github.com/gutzbenj/88e69d10447698d099f7227a731add9b [2] https://gist.github.com/amotl/9fc67b696cbab9f0667be60de4dcf2be
P.S.: We are using Python 3.8 and fsspec 0.8.5.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
Certainly.
I will comment that, as I mentioned, there is not a concrete way to say if a URL is meant to be a directory or a file. Currently,
isfile
says True for any URL that is reachable (which means calling this server, every time), but in this context is might be reasonable to replace withDear Martin,
I have handed in a small patch with #507 . What we’ve seen is resulting from a variable naming in the
find()
method. I simply added an underscore to the path variable (-> path_) coming from self.walk() to supress this behaviour.