Mixed ploidy genotype_call_dataset
See original GitHub issueOverview
This is a proposal to support mixed-ploidy data within a single genotype_call_dataset.
It aims to have minimal impact on existing code and to avoid causing issues when implementing diploid only functionality in future.
Criticism is very welcome!
Within the VCF specification, mixed-ploidy data can be defined by using a mixture of genotype string (GT) lengths. For example the following encodes one diploid and two tetraploid individuals:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 SAMPLE3
1 1000 . C G 60 PASS NS=3;DP=30 GT 0/0/0/0 0/0 0/0/0/1
1 2000 . T A 60 PASS NS=3;DP=30 GT 0/0/0/0 0/1 0/1/./.
Implementation
I think that the simplest approach to supporting mixed ploidy in a genotype_call_dataset is to introduce a second sentinel value e.g. -2 which indicates a non-allele.
The above VCF would translate to the array:
call_genotype = da.array([
[[0, 0, 0, 0], [0, 0, -2, -2], [0, 0, 0, 1]],
[[0, 0, 0, 0], [0, 1, -2, -2], [0, 1, -1, -1]]
])
A convenience mask could also be created, similar to the one indicating missing values:
call_genotype_mask = call_genotype == -1
call_genotype_non_allele = call_genotype < -1
we can also include the ploidy of each genotype call:
call_genotype_ploidy = (~call_genotype_non_allele).sum(axis=-1) # shape (variants, samples)
and ploidy across each sample or variant using -1 to indicate mixed/inconsistent ploidy:
sample_ploidy_fixed = (call_genotype_ploidy[0,:] == call_genotype_ploidy).all(axis=0)
sample_ploidy = xr.where(sample_ploidy_fixed, call_genotype_ploidy[0,:], -1) # shape (samples,)
variant_ploidy_fixed = (call_genotype_ploidy[:,0] == call_genotype_ploidy.T).all(axis=0)
variant_ploidy = xr.where(variant_ploidy_fixed, call_genotype_ploidy[:,0], -1) # shape (variants,)
Checking if a dataset has consistent ploidy
The most difficult aspect of this proposal is ensuring that mixed-ploidy datasets do not get used in a function that only supports diploid (or fixed ploidy) data.
Simply checking the ploidy dimension size (e.g.) would no longer be reliable because it would only indicate the maximum ploidy.
Calculating the actual ploidy of a (potentially mixed ploidy) dataset will ultimately require checking all genotype calls in the dataset for a value <= -2.
A workaround for this issue is to trust the user to specify if their data is mixed-ploidy or not during dataset creation by adding a mixed_ploidy argument (default=False) to create_genotype_call_dataset (or reader functions).
If this argument is False then sample_ploidy etc. can simply return the size of the ploidy dimension.
If it is true then sample_ploidy etc. are caluclated as above.
The value of the mixed_ploidy argument would be stored as a dataset attribute to enable a quick check for fixed-ploidy only functions.
Functions that require ploidy to be fixed in the sample or variant dimension can check sample_ploidy or variant_ploidy repectively which will be quick if mixed_ploidy=False.
The alternative to letting the user specify mixed_ploidy is to check the call_genotype array for values <= -2.
Implications for existing code
This proposal would add a mixed_ploidy argument and attribute to create_genotype_call_dataset with a default value of False.
It would also insert the following dask arrays into a genotype_call_dataset :
call_genotype_non_allelewith shape(variants, samples, ploidy)call_genotype_ploidywith shape(variants, samples)sample_ploidywith shape(samples,)variant_ploidywith shape(variants,)
Existing code that checks for the ploidy of a dataset would need to check if the dataset is mixed-ploidy via the new attribute.
Some functions including display_genotypes would need to be adapted to handle mixed ploidy data.
Issue Analytics
- State:
- Created 3 years ago
- Comments:13

Top Related StackOverflow Question
cyvcf2 v-0.20.8 now supports mixed ploidy in
Genotype.array()also using-2to indicate non-alleles.I have a very basic implementation of
vcf_to_zarr_sequentialthat supports mixed ploidy. This requires the user to specify a maximum ploidy and will truncate genotypes if they exceed this maximum. I think this is the most practical way to go, probably throwing a warning if a genotype is truncated.It looks like this should probably wait on #258
That sounds great @timothymillar. I’ve just submitted #289, so the VCF migration should hopefully be done before too long.
BTW I had to make a small change to a test due to the mixed ploidy change, which shows that your change has been picked up!