Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

exists() on non-existent files might take up to 6 seconds

See original GitHub issue

In the following example, the first exists() call takes about ~5.4 seconds and the next one takes ~3 seconds. This is probably because when a key doesn’t exist we first call head_object and then proceed with the list_objects to see whether if it is a directory. If neither of those checks out, we end up with waiting 6 seconds.

import time
import s3fs

path = "existing-bucket/something"

fs = s3fs.S3FileSystem()
fs.rm(path)

t0 = time.perf_counter()
fs.exists(path)
t1 = time.perf_counter()
print("exists() on a non-existent file took: ", t1 - t0, "seconds")

fs.touch(path)

t0 = time.perf_counter()
fs.exists(path)
t1 = time.perf_counter()
print("exists() on an existing file took: ", t1 - t0, "seconds")

fs.rm(path)

Issue Analytics

State:
Created 2 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

isidenticalcommented, Apr 29, 2021

So after a bit of research, I’ve found out that the list_objects_v2 API when used with a prefix starts to yield results in a UTF-8 ascending order. This means that, if there is a file that has the exact name as the prefix that we are passing, it is going to be the first result. We can use this assumption to call list_objects_v2 with max_keys=1 (just like we do for directories but without the / at the end) and check out if there are any matches for that file.

Objects are returned sorted in an ascending order of the respective key names in the list. ListObjectsV2

List results are always returned in UTF-8 binary order. Listing objects

If there is, it will be the only result. We can’t judge whether it is an directory or not with this simple call, but we can ensure that whether such a prefix exists by checking if there are any keys being yielded. So that we can determine the file/directory’s existence with a simple API call, and for files it is something that is about 1.5-2x faster.

One problem with this is that, it doesn’t support object versions. So we can implement this particular case an optimization when the s3fs is initiaized in non-version aware mode. These are the results I got for the script shared;

before:
exists() on a non-existent file took:  5.687416751000001 seconds
exists() on a non-existent directory took:  5.066671551999889 seconds
exists() on an existing file took:  3.102334019999944 seconds
exists() on an existing directory took:  5.09718068899997 seconds

after:
exists() on a non-existent file took:  1.8688721430000896 seconds
exists() on a non-existent directory took:  1.1779657580000276 seconds
exists() on an existing file took:  1.170549130999916 seconds
exists() on an existing directory took:  2.316443058999994 seconds

0reactions

martindurantcommented, Apr 28, 2021

There may be an argument for caching HEAD calls, I suppose. They would need to be separately invalidated, though, so add a layer of complexity. Do you find repeated exists() calls on the same path a lot?

Yes, your understanding of _ls_from_cache is correct.