question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

exists() on non-existent files might take up to 6 seconds

See original GitHub issue

In the following example, the first exists() call takes about ~5.4 seconds and the next one takes ~3 seconds. This is probably because when a key doesn’t exist we first call head_object and then proceed with the list_objects to see whether if it is a directory. If neither of those checks out, we end up with waiting 6 seconds.

import time
import s3fs

path = "existing-bucket/something"

fs = s3fs.S3FileSystem()
fs.rm(path)

t0 = time.perf_counter()
fs.exists(path)
t1 = time.perf_counter()
print("exists() on a non-existent file took: ", t1 - t0, "seconds")

fs.touch(path)

t0 = time.perf_counter()
fs.exists(path)
t1 = time.perf_counter()
print("exists() on an existing file took: ", t1 - t0, "seconds")

fs.rm(path)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
isidenticalcommented, Apr 29, 2021

So after a bit of research, I’ve found out that the list_objects_v2 API when used with a prefix starts to yield results in a UTF-8 ascending order. This means that, if there is a file that has the exact name as the prefix that we are passing, it is going to be the first result. We can use this assumption to call list_objects_v2 with max_keys=1 (just like we do for directories but without the / at the end) and check out if there are any matches for that file.

Objects are returned sorted in an ascending order of the respective key names in the list. ListObjectsV2

List results are always returned in UTF-8 binary order. Listing objects

If there is, it will be the only result. We can’t judge whether it is an directory or not with this simple call, but we can ensure that whether such a prefix exists by checking if there are any keys being yielded. So that we can determine the file/directory’s existence with a simple API call, and for files it is something that is about 1.5-2x faster.

One problem with this is that, it doesn’t support object versions. So we can implement this particular case an optimization when the s3fs is initiaized in non-version aware mode. These are the results I got for the script shared;

before:
exists() on a non-existent file took:  5.687416751000001 seconds
exists() on a non-existent directory took:  5.066671551999889 seconds
exists() on an existing file took:  3.102334019999944 seconds
exists() on an existing directory took:  5.09718068899997 seconds
after:
exists() on a non-existent file took:  1.8688721430000896 seconds
exists() on a non-existent directory took:  1.1779657580000276 seconds
exists() on an existing file took:  1.170549130999916 seconds
exists() on an existing directory took:  2.316443058999994 seconds
0reactions
martindurantcommented, Apr 28, 2021

There may be an argument for caching HEAD calls, I suppose. They would need to be separately invalidated, though, so add a layer of complexity. Do you find repeated exists() calls on the same path a lot?

Yes, your understanding of _ls_from_cache is correct.

Read more comments on GitHub >

github_iconTop Results From Across the Web

c# - Speed up File.Exists for non existing network shares
In this case it takes a pretty long time (30 or 60 seconds) to timeout. Questions. Is there a way to shorten the...
Read more >
"Missing files after 5 seconds" but files already exist; strange ...
I have a set of existing files which snakemake will take as input. As a wildcard I have the variable 'sample', which takes...
Read more >
Is there a file that will always not exist? - Unix Stack Exchange
As an alternative, I would suggest that your script create a temporary directory, and then look for a file name in there.
Read more >
Missing temporary files in load balanced environments - Drupal
Problem/Motivation When an import needs multiple cron runs to complete and when using the HTTP Fetcher, Feeds uses the temporary directory ...
Read more >
Validate PowerShell to Check if a File Exists (Examples)
Learn how to use PowerShell to check if a file exists with Test-Path, Get-Item and even .NET in this in depth guide!
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found