question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

High memory consumption (50 GB) for genbank bacteria

See original GitHub issue

I was using ncbi-genome-download to fetch all genbank bacteria (currently 630k). This seems to be a challenging task, I am unsure if this is even feasible this way. I noticed it will require up to 50GB of memory and wondered how that can be. I am currently debugging the tool to figure it out, but thought maybe it is known to someone. This would save me some trouble. I am launching it like this:

ncbi-genome-download -F fasta -s genbank     \
    --human-readable        \
    --retries 2     \
    --parallel 4    \
    --no-cache     \
    --verbose     \
    -o genomes     \
    bacteria

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
kblincommented, May 18, 2020

I mean I agree that the tool could be more careful with dumping things from memory again, but it was build for workflows like “get me all assemblies of genus Streptomyces that have an assembly level of ‘complete’ or ‘chromosome’ from GenBank”, and not “download all 630000 bacterial genomes on the server”.

1reaction
kblincommented, May 18, 2020

ncbi-genome-download was built to allow intelligent filtering of files, not to just go grab all of them, and only the large file downloads were run in parallel, which is why there’s this collect and execute split. If you don’t use the tool to download all the things at once, the memory consumption is very reasonable, and that’s what I’ve built it for. I’m happy to look at patches to restructure things, but it’s not trivial and if you don’t really need the filtering, a simpler script parsing the assembly_summary.txt file and then just downloading all those files might serve your needs better.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · kblin/ncbi-genome-download - GitHub
Scripts to download genomes from the NCBI FTP servers - Issues ... High memory consumption (50 GB) for genbank bacteria enhancement help wanted....
Read more >
Insights from 20 years of bacterial genome sequencing - NCBI
Sequencing of bacterial genome sequences is now a standard procedure, and the information from tens of thousands of bacterial genomes has had a...
Read more >
GenBank | Nucleic Acids Research - Oxford Academic
GenBank (1) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotations built and ...
Read more >
Comparing Memory-Efficient Genome Assemblers on Stand ...
impatiens genome using about 20 GB RAM, whereas the original Velvet program would crash because of insufficient memory on a 192 GB server....
Read more >
High throughput ANI analysis of 90K prokaryotic genomes ...
For the above experiments, FastANI required a maximum 62 GB memory for D5, our largest dataset for this experiment. For databases much larger ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found