Overzealous filtering of reads during CallDuplexConsensusReads
See original GitHub issueHi,
I’m using GroupReadsByUmi
followed by CallDuplexConsensusReads
on high read depth, high quality paired end reads. GroupReadsByUmi
creates a 25GB bam file. When I then call CallDuplexConsensusReads
, the ouput is only about 500MB. I was expecting a similarly sized ouput file.
Here’s the commands I run
java -Xmx64g -jar fgbio.jar GroupReadsByUmi --strategy=paired --input=my_sample_mapped.bam" --output=my_sample_groupedUMI.bam" --raw-tag=RX --assign-tag=MI --min-map-q=10 --edits=1
java -Xmx64g -jar fgbio.jar CallDuplexConsensusReads --input=my_sample_groupedUMI.bam" --output=my_sample_ds_consensus_unaligned.bam" --error-rate-pre-umi=45 --error-rate-post-umi=30 --min-input-base-quality=10 --threads=12
Am I misunderstanding the output of CallDuplexConsensusReads
?
Is there a next step I need to do such as merging the CallDuplexConsensusReads
’s output with some other bam file in order to get a bam with all duplex consensus reads?
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Yeah, that’s your problem. Setting min reads to 0 or perhaps
--min-reads 1 0 0
would work. But beware then that the vast majority of your output will basically be consensus reads formed from single reads, which is largely the same as having the raw reads themselves.The other possibility here is that the
--min-reads
defaults are causing a lot of reads to be discarded. By default it requires at least 1 read from each strand to form a duplex consensus. If you have a lot of molecules with only reads from one of the two original strands, then a lot of your data will be discarded/filtered and not make it into consensus.