Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

High memory consumption (50 GB) for genbank bacteria

See original GitHub issue

I was using ncbi-genome-download to fetch all genbank bacteria (currently 630k). This seems to be a challenging task, I am unsure if this is even feasible this way. I noticed it will require up to 50GB of memory and wondered how that can be. I am currently debugging the tool to figure it out, but thought maybe it is known to someone. This would save me some trouble. I am launching it like this:

ncbi-genome-download -F fasta -s genbank     \
    --human-readable        \
    --retries 2     \
    --parallel 4    \
    --no-cache     \
    --verbose     \
    -o genomes     \
    bacteria

Issue Analytics

State:
Created 3 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

2reactions

kblincommented, May 18, 2020

I mean I agree that the tool could be more careful with dumping things from memory again, but it was build for workflows like “get me all assemblies of genus Streptomyces that have an assembly level of ‘complete’ or ‘chromosome’ from GenBank”, and not “download all 630000 bacterial genomes on the server”.

1reaction

kblincommented, May 18, 2020

ncbi-genome-download was built to allow intelligent filtering of files, not to just go grab all of them, and only the large file downloads were run in parallel, which is why there’s this collect and execute split. If you don’t use the tool to download all the things at once, the memory consumption is very reasonable, and that’s what I’ve built it for. I’m happy to look at patches to restructure things, but it’s not trivial and if you don’t really need the filtering, a simpler script parsing the assembly_summary.txt file and then just downloading all those files might serve your needs better.