glob is inefficient as it's iterating a dir that already scanned
See original GitHub issueI tried glob method and found it is too slow when there’re millions of files in the directory.
turns out that the glob method will first call list_objects_v2
api first, get all files (every single file including folders and files), identify all files to see if they are folders. and then scan the folders.
The algorighm is corret in traditional fs, while inefficient in s3, s3 will return every object when requesting list_objects_v2
api, iterating subfolders are unneccessary.
Is that possible to fix it in s3path or it can only be fixed in pathlib ?
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (4 by maintainers)
Top Results From Across the Web
How to use glob() to find files recursively? - Stack Overflow
If recursive is True (default is False ), the pattern ** will match any files and zero or more directories and subdirectories ....
Read more >Calling a Function indexed from within a For loop - Raspberry Pi ...
The business part of my code now looks like this. Code: Select all DefCall = [Dir, Track, Slow, Med, Fast] def scan(): global...
Read more >PEP 471 – os.scandir() function – a better and faster directory ...
It returns a generator instead of a list, so that scandir acts as a true iterator instead of returning the full list immediately....
Read more >Walking with filesystems: Go's new fs.FS interface
The new io/fs package introduced in Go 1.16 gives us a powerful new way of working with filesystems: that is, trees of files....
Read more >loop through all files in a directory python - You.com | The search ...
os.listdir (), os.scandir (), pathlib module, os.walk (), and glob module are the methods available to iterate over files. A directory is also...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Oh man, python why!? That’s a mega bummer!
@four43 yes, you are right One of the optimizations that I want to do is remove this list creation in the s3 implementation