some samples running easily, others never finishing with dedup
See original GitHub issueHi folks,
I’m using umi_tools 1.0.0 on two cohorts of miRNA BAM files.
Set 1 has about 40 million reads per bam with UMI format in the RX tag example of “CAGC-CCAC”
Set 2 has about 10 million reads per bam with the UMIs in the RX tab being slightly longer as in “AACCTC-AAATTG”
All dedup commands look like one of the following (I’ve tried both and gotten similar results):
umi_tools dedup -I ${1} --extract-umi-method=tag --umi-tag=RX -S ${1}.umi_tools_100_deduplicated.bam --output-stats=${1}.umi_tools_100_deduplicated.stats
umi_tools dedup -I ${1} --extract-umi-method=tag --umi-tag=RX --read-length -S ${1}.umi_tools_100_deduplicated_read_length.bam --output-stats=${1}.umi_tools_100_deduplicated_read_length.stats
My Set 1 commands dependably finish in less than a day. About half of the Set 2 datasets are killed on my cluster after they hit a RAM occupancy above 355Gb.
Do you have any suggestions or things I could look in to to get this running well on all my samples?
thanks Richard
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Advanced Data Deduplication settings | Microsoft Learn
This document describes how to modify advanced Data Deduplication settings. For recommended workloads, the default settings should be ...
Read more >Video data backups/secondary copies with Dedupe | Community
Good morning I have a customer that is backing up in car videos (local sheriff's dept) and he is deduping this data. We...
Read more >Ultimate HubSpot Deduplication Guide | Dedupely
At the end of the day, HubSpot is as susceptible to duplicates as any spreadsheet, database or other CRM or record-based system.
Read more >Basics of Entity Resolution with Python and Dedupe - Medium
To get Dedupe running, we'll need to install unidecode , future ... can see in the example above, some comparisons decisions are very...
Read more >5 Solutions to iPhone Backup Never Finishes?
You might encounter some issues as iCloud backup stuck. And iTunes sometimes is not stable to have a backup. So lots of people...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
See https://umi-tools.readthedocs.io/en/latest/faq.html for advice on speeding up/memory usage.
The running time/memory is far more dependent on the length of the UMI and the level of duplication than it is on the total number of reads.
The biggest thing you can do here to improve things is not generate the stats. The stats generation is by far the biggest time and space hog when used as it randomly samples reads from the file to compute a null distribution.
Hi Richard,
I hope you eventually managed to find a satifactory way through this.
We are currently in the process of applying for funding to make a real change in the efficiency of UMI-tools. If you are still interested in the tool, I wondered if you might be able to support the application by writing a letter saying how useful it would be for you if UMI-tools went fast/didn’t use as much memory?