BALSAMIC's merged somatic mutation VCF header and to discuss info/format tags
See original GitHub issueHi!
BALSAMIC’s merged SNV and small indel VCF is finalized and it will be the last piece of release 3.0.0. Before releasing, I thought I should give heads up on annotations and tags. Also to get a feedback on it
The header and four example variants after VEP annotation is pasted below (I removed VEP annotation from variants to make lines shorter, but you get the idea). Also samples are named NOMRAL and TUMOR to make it more clear.
##fileformat=VCFv4.1
##fileDate=20190807
##source=VCFmerge
##source_version=0.0.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=1,Type=Float,Description="Allele fraction of the event">
##FORMAT=<ID=DP,Number=1,Type=Float,Description="Read depth in the sample">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Combined depth across samples recalculated from reported AD">
##INFO=<ID=VARCALLER,Number=.,Type=String,Description="Variant caller called this variant separated by comma">
##INFO=<ID=VARCALLER_FILTER,Number=.,Type=String,Description="Variant caller filters assigned to this variant separated by comma">
##INFO=<ID=VARCALLER_DP,Number=.,Type=String,Description="Variant caller depth assigned to this variant separated by comma">
##INFO=<ID=VARCALLER_COUNT,Number=1,Type=Integer,Description="Number of variant callers called this variant">
##INFO=<ID=VARCALLER_QUAL,Number=.,Type=String,Description="Variant quality assigned to this variant by variant callers separated by comma">
##INFO=<ID=VARCALLER_NORMAL_GT,Number=.,Type=String,Description="Genotype for NORMAL sample assigned by variant callers.">
##INFO=<ID=VARCALLER_TUMOR_GT,Number=.,Type=String,Description="Genotype for TUMOR sample assigned by variant callers.">
##INFO=<ID=TYPE,Number=1,Type=String,Description="Variant type assigned by bcftools 1.9. snp, mnp, indel, other ">
INFO=<ID=GC,Number=1,Type=Float,Description="GC content around the variant">
INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=Y>
##contig=<ID=X>
##VEP="v94" time="2019-08-06 11:01:13" cache="vep_cache" ensembl-io=94.8d53275 ensembl=94.5c08d90 ensembl-funcgen=94.08b0c13 ensembl-variation=94.066b102 1000genomes="phase3" COSMIC="81" ClinVar="201706" ESP="20141103" HGMD-PUBLIC="20164" assembly="GRCh37.p13" dbSNP="150" gencode="GENCODE 19" genebuild="2011-04" gnomAD="170228" polyphen="2.2.2" refseq="01_2015" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|REFSEQ_MATCH|SOURCE|GIVEN_REF|USED_REF|BAM_EDIT|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|gnomAD_AF|gnomAD_AFR_AF|gnomAD_AMR_AF|gnomAD_ASJ_AF|gnomAD_EAS_AF|gnomAD_FIN_AF|gnomAD_NFE_AF|gnomAD_OTH_AF|gnomAD_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
1 36932852 . C T . PASS TYPE=SNP;VARCALLER=mutect2;DP=2932;VARCALLER_FILTER=PASS;VARCALLER_DP=2932;VARCALLER_QUAL=.;VARCALLER_COUNT=1;VARCALLER_NORMAL_GT=0/0;VARCALLER_TUMOR_GT=0/1 DP:AD:AF:GT 1384:1376,8:1392:./. 1548:1538,10:1558:./.
1 215848824 . C T 377 PASS TYPE=SNP;VARCALLER=mutect2,vardict,strelka;DP=5589;VARCALLER_FILTER=clustered_events|multi_event_alt_allele_in_normal,PASS,PASS;VARCALLER_DP=2081,5589,5549;VARCALLER_QUAL=.,377,.;VARCALLER_COUNT=3;VARCALLER_NORMAL_GT=0/0,0/0,.;VARCALLER_TUMOR_GT=0/1,0/1,. DP:AD:AF:GT 2542:2541,1:2543:./. 3047:1664,1383:4430:./.
1 216371793 . A G 373 PASS TYPE=SNP;VARCALLER=vardict,strelka;DP=4784;VARCALLER_FILTER=PASS,LowEVS;VARCALLER_DP=4784,4746;VARCALLER_QUAL=373,.;VARCALLER_COUNT=2;VARCALLER_NORMAL_GT=0/1,.;VARCALLER_TUMOR_GT=0/1,. DP:AD:AF:GT 2188:1127,1061:3249:./. 2596:1418,1178:3774:./.
2 145156768 . G C 192 . TYPE=SNP;VARCALLER=mutect2,vardict,strelka;DP=5546;VARCALLER_FILTER=clustered_events|multi_event_alt_allele_in_normal,p8|P0.01Likely,LowEVS;VARCALLER_DP=1837,5546,5522;VARCALLER_QUAL=.,192,.;VARCALLER_COUNT=3;VARCALLER_NORMAL_GT=0/0,0/1,.;VARCALLER_TUMOR_GT=0/1,0/1,. DP:AD:AF:GT 2531:2512,19:2550:./. 3015:2977,38:3053:./.
This is an output of a package I am working on to merge VCF for SN{P/V} and small INDELs, called: VCFmerge. It has bunch of models and stat crunching using multiple bioinfo tools and python packages to prepare a final VCF file from multiple variant callers (it is essentially a wrapper). VCFmerge it is supporting any standard VCF from any variant callers and Strelka.
All the INFO and FORMAT tags are computed from input BAM files, and anything caller specific is removed (all the MQ and GC is also recalculated). Some of the lost info are kept withinINFO/VARIANTCALLER*
tags, which I think can be used to display on Scout on variant level.
What are your thoughts on this? The new release for BALSAMIC is way past its due, so I appreciate quick comments.
PS: these are dummy data, and these variant are not real somatic mutations.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Of course. I know where the issue is, when goes into prod, I’ll reopen 😃
FYI, vcfmerge was a python package I was working on ensemble calling of somatic mutations based on posterior calls. But it was very hard to speed up…