Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Overzealous filtering of reads during CallDuplexConsensusReads

See original GitHub issue

Hi,

I’m using GroupReadsByUmi followed by CallDuplexConsensusReads on high read depth, high quality paired end reads. GroupReadsByUmi creates a 25GB bam file. When I then call CallDuplexConsensusReads, the ouput is only about 500MB. I was expecting a similarly sized ouput file.

Here’s the commands I run

java -Xmx64g -jar fgbio.jar GroupReadsByUmi --strategy=paired --input=my_sample_mapped.bam" --output=my_sample_groupedUMI.bam" --raw-tag=RX --assign-tag=MI --min-map-q=10 --edits=1

java -Xmx64g -jar fgbio.jar CallDuplexConsensusReads --input=my_sample_groupedUMI.bam" --output=my_sample_ds_consensus_unaligned.bam" --error-rate-pre-umi=45 --error-rate-post-umi=30 --min-input-base-quality=10 --threads=12

Am I misunderstanding the output of CallDuplexConsensusReads ? Is there a next step I need to do such as merging the CallDuplexConsensusReads’s output with some other bam file in order to get a bam with all duplex consensus reads?

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

tfennecommented, Sep 12, 2019

Yeah, that’s your problem. Setting min reads to 0 or perhaps --min-reads 1 0 0 would work. But beware then that the vast majority of your output will basically be consensus reads formed from single reads, which is largely the same as having the raw reads themselves.

1reaction

tfennecommented, Sep 11, 2019

The other possibility here is that the --min-reads defaults are causing a lot of reads to be discarded. By default it requires at least 1 read from each strand to form a duplex consensus. If you have a lot of molecules with only reads from one of the two original strands, then a lot of your data will be discarded/filtered and not make it into consensus.