exists() on non-existent files might take up to 6 seconds
See original GitHub issueIn the following example, the first exists()
call takes about ~5.4 seconds and the next one takes ~3 seconds. This is probably because when a key doesn’t exist we first call head_object
and then proceed with the list_objects
to see whether if it is a directory. If neither of those checks out, we end up with waiting 6 seconds.
import time
import s3fs
path = "existing-bucket/something"
fs = s3fs.S3FileSystem()
fs.rm(path)
t0 = time.perf_counter()
fs.exists(path)
t1 = time.perf_counter()
print("exists() on a non-existent file took: ", t1 - t0, "seconds")
fs.touch(path)
t0 = time.perf_counter()
fs.exists(path)
t1 = time.perf_counter()
print("exists() on an existing file took: ", t1 - t0, "seconds")
fs.rm(path)
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
c# - Speed up File.Exists for non existing network shares
In this case it takes a pretty long time (30 or 60 seconds) to timeout. Questions. Is there a way to shorten the...
Read more >"Missing files after 5 seconds" but files already exist; strange ...
I have a set of existing files which snakemake will take as input. As a wildcard I have the variable 'sample', which takes...
Read more >Is there a file that will always not exist? - Unix Stack Exchange
As an alternative, I would suggest that your script create a temporary directory, and then look for a file name in there.
Read more >Missing temporary files in load balanced environments - Drupal
Problem/Motivation When an import needs multiple cron runs to complete and when using the HTTP Fetcher, Feeds uses the temporary directory ...
Read more >Validate PowerShell to Check if a File Exists (Examples)
Learn how to use PowerShell to check if a file exists with Test-Path, Get-Item and even .NET in this in depth guide!
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
So after a bit of research, I’ve found out that the
list_objects_v2
API when used with a prefix starts to yield results in a UTF-8 ascending order. This means that, if there is a file that has the exact name as the prefix that we are passing, it is going to be the first result. We can use this assumption to calllist_objects_v2
withmax_keys=1
(just like we do for directories but without the/
at the end) and check out if there are any matches for that file.If there is, it will be the only result. We can’t judge whether it is an directory or not with this simple call, but we can ensure that whether such a prefix exists by checking if there are any keys being yielded. So that we can determine the file/directory’s existence with a simple API call, and for files it is something that is about 1.5-2x faster.
One problem with this is that, it doesn’t support object versions. So we can implement this particular case an optimization when the s3fs is initiaized in non-version aware mode. These are the results I got for the script shared;
There may be an argument for caching HEAD calls, I suppose. They would need to be separately invalidated, though, so add a layer of complexity. Do you find repeated
exists()
calls on the same path a lot?Yes, your understanding of
_ls_from_cache
is correct.