gsutil cp hangs on many small files when running in parallel
See original GitHub issueI have a GCS bucket with millions of small files in different folders. When I run:
$ gsutil -m cp -r gs://my-bucket .
The process will eventually hang before completion, sometimes after 5 minutes and sometimes after several hours. This seems to be is 100% reproducible. I’m using version 4.27 but this has happened in older versions as well. As a workaround have I to use:
$ gsutil cp -r gs://my-bucket .
which works but it takes several days to download everything so it’s not optimal.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:9
- Comments:24
Top Results From Across the Web
gsutil hangs on large file - google cloud platform
Here are some workaround that you can try : You can try uploading it across multiple folders on a single bucket since there...
Read more >cp - Copy files and objects | Cloud Storage - Google Cloud
The gsutil cp command allows you to copy data between your local file system ... If you have a large number of files...
Read more >Easily parallelize large scale data copies into Google Cloud ...
This command runs pretty quick, 4 million lines takes about 3 seconds. The —-number=r means we are round robin'ing the file names into...
Read more >Optimize data transfer between Compute Engine and Cloud ...
Useful for transferring a large number of files in parallel, not the upload ... time gsutil cp temp_30GB_file gs://doit-speed-test-bucket/ ...
Read more >gsutil cp – Copy and Move Files on Google Cloud
Learn how to use the gsutil cp command to copy files from local to ... -m option to upload large number of files...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For what it’s worth, I found that using only threads for parallelization (and not child processes) appears to avoid the underlying deadlock here. e.g.
-o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24
I am still facing the issue, it reaches 99% of copied files and then terminates.