Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

some samples running easily, others never finishing with dedup

See original GitHub issue

Hi folks,

I’m using umi_tools 1.0.0 on two cohorts of miRNA BAM files.

Set 1 has about 40 million reads per bam with UMI format in the RX tag example of “CAGC-CCAC”

Set 2 has about 10 million reads per bam with the UMIs in the RX tab being slightly longer as in “AACCTC-AAATTG”

All dedup commands look like one of the following (I’ve tried both and gotten similar results):

umi_tools dedup -I ${1} --extract-umi-method=tag --umi-tag=RX -S ${1}.umi_tools_100_deduplicated.bam --output-stats=${1}.umi_tools_100_deduplicated.stats

umi_tools dedup -I ${1} --extract-umi-method=tag --umi-tag=RX --read-length -S ${1}.umi_tools_100_deduplicated_read_length.bam --output-stats=${1}.umi_tools_100_deduplicated_read_length.stats

My Set 1 commands dependably finish in less than a day. About half of the Set 2 datasets are killed on my cluster after they hit a RAM occupancy above 355Gb.

Do you have any suggestions or things I could look in to to get this running well on all my samples?

thanks Richard

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

IanSudberycommented, Jun 6, 2019

See for advice on speeding up/memory usage.

The running time/memory is far more dependent on the length of the UMI and the level of duplication than it is on the total number of reads.

The biggest thing you can do here to improve things is not generate the stats. The stats generation is by far the biggest time and space hog when used as it randomly samples reads from the file to compute a null distribution.

IanSudberycommented, Nov 4, 2019

Hi Richard,

I hope you eventually managed to find a satifactory way through this.

We are currently in the process of applying for funding to make a real change in the efficiency of UMI-tools. If you are still interested in the tool, I wondered if you might be able to support the application by writing a letter saying how useful it would be for you if UMI-tools went fast/didn’t use as much memory?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Advanced Data Deduplication settings | Microsoft Learn
This document describes how to modify advanced Data Deduplication settings. For recommended workloads, the default settings should be ...
Read more >
Video data backups/secondary copies with Dedupe | Community
Good morning I have a customer that is backing up in car videos (local sheriff's dept) and he is deduping this data. We...
Read more >
Ultimate HubSpot Deduplication Guide | Dedupely
At the end of the day, HubSpot is as susceptible to duplicates as any spreadsheet, database or other CRM or record-based system.
Read more >
Basics of Entity Resolution with Python and Dedupe - Medium
To get Dedupe running, we'll need to install unidecode , future ... can see in the example above, some comparisons decisions are very...
Read more >
5 Solutions to iPhone Backup Never Finishes?
You might encounter some issues as iCloud backup stuck. And iTunes sometimes is not stable to have a backup. So lots of people...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Post

No results found

github_iconTop Related Hashnode Post

No results found