High memory consumption (50 GB) for genbank bacteria
See original GitHub issueI was using ncbi-genome-download to fetch all genbank bacteria (currently 630k). This seems to be a challenging task, I am unsure if this is even feasible this way. I noticed it will require up to 50GB of memory and wondered how that can be. I am currently debugging the tool to figure it out, but thought maybe it is known to someone. This would save me some trouble. I am launching it like this:
ncbi-genome-download -F fasta -s genbank \
--human-readable \
--retries 2 \
--parallel 4 \
--no-cache \
--verbose \
-o genomes \
bacteria
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (3 by maintainers)
Top Results From Across the Web
Issues · kblin/ncbi-genome-download - GitHub
Scripts to download genomes from the NCBI FTP servers - Issues ... High memory consumption (50 GB) for genbank bacteria enhancement help wanted....
Read more >Insights from 20 years of bacterial genome sequencing - NCBI
Sequencing of bacterial genome sequences is now a standard procedure, and the information from tens of thousands of bacterial genomes has had a...
Read more >GenBank | Nucleic Acids Research - Oxford Academic
GenBank (1) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotations built and ...
Read more >Comparing Memory-Efficient Genome Assemblers on Stand ...
impatiens genome using about 20 GB RAM, whereas the original Velvet program would crash because of insufficient memory on a 192 GB server....
Read more >High throughput ANI analysis of 90K prokaryotic genomes ...
For the above experiments, FastANI required a maximum 62 GB memory for D5, our largest dataset for this experiment. For databases much larger ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I mean I agree that the tool could be more careful with dumping things from memory again, but it was build for workflows like “get me all assemblies of genus Streptomyces that have an assembly level of ‘complete’ or ‘chromosome’ from GenBank”, and not “download all 630000 bacterial genomes on the server”.
ncbi-genome-download was built to allow intelligent filtering of files, not to just go grab all of them, and only the large file downloads were run in parallel, which is why there’s this collect and execute split. If you don’t use the tool to download all the things at once, the memory consumption is very reasonable, and that’s what I’ve built it for. I’m happy to look at patches to restructure things, but it’s not trivial and if you don’t really need the filtering, a simpler script parsing the
assembly_summary.txt
file and then just downloading all those files might serve your needs better.