Upstream deletions and CollectVariantCallingMetrics do not play nice right now.
See original GitHub issueThe current VCF spec allows for a *
allele (no brackets):
“The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion.”
CollectVariantCallingMetrics treats this as a third (size 1!!) allele so that in the case of
1 10347 . TAAACCCTA T 100 . AC=2 GT 0/1 0/1
1 10350 . A C,* 100 . AC=3 GT 1/2 0/2
both the 0/2 and 1/2 genotypes in the second line are counted towards TOTAL_MULTIALLELIC_SNPS (for the detailed metrics) Also, both of these genotype will not be counted towards the TOTAL_SNPS (as that only captures bi-alleleic SNPs). So upstream deletions are “hurting” both the monomorphic samples (as they get an inflated TOTAL_MULTIALLELIC_SNPS ) and the polymorphic samples (as they get a deflated TOTAL_SNPS count)
I propose changing this behavior so that an upstream deletion will count as the reference allele for the purpose of metrics.
I will also add a few column or two to capture the number of upstream deletions, perhaps counting the 0/2 separately from the 1/2 genotypes.
Does this sounds reasonable to folks?
Issue Analytics
- State:
- Created 7 years ago
- Comments:22 (15 by maintainers)
Top GitHub Comments
still a thing! on my “todo” list too!
On Wed, Jan 18, 2017 at 8:23 PM, Geraldine Van der Auwera < notifications@github.com> wrote:
This needs to be put on hold until I modify VariantContext in HtsJdk…