Mixed ploidy genotype_call_dataset
See original GitHub issueOverview
This is a proposal to support mixed-ploidy data within a single genotype_call_dataset
.
It aims to have minimal impact on existing code and to avoid causing issues when implementing diploid only functionality in future.
Criticism is very welcome!
Within the VCF specification, mixed-ploidy data can be defined by using a mixture of genotype string (GT) lengths. For example the following encodes one diploid and two tetraploid individuals:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 SAMPLE3
1 1000 . C G 60 PASS NS=3;DP=30 GT 0/0/0/0 0/0 0/0/0/1
1 2000 . T A 60 PASS NS=3;DP=30 GT 0/0/0/0 0/1 0/1/./.
Implementation
I think that the simplest approach to supporting mixed ploidy in a genotype_call_dataset
is to introduce a second sentinel value e.g. -2
which indicates a non-allele.
The above VCF would translate to the array:
call_genotype = da.array([
[[0, 0, 0, 0], [0, 0, -2, -2], [0, 0, 0, 1]],
[[0, 0, 0, 0], [0, 1, -2, -2], [0, 1, -1, -1]]
])
A convenience mask could also be created, similar to the one indicating missing values:
call_genotype_mask = call_genotype == -1
call_genotype_non_allele = call_genotype < -1
we can also include the ploidy of each genotype call:
call_genotype_ploidy = (~call_genotype_non_allele).sum(axis=-1) # shape (variants, samples)
and ploidy across each sample or variant using -1
to indicate mixed/inconsistent ploidy:
sample_ploidy_fixed = (call_genotype_ploidy[0,:] == call_genotype_ploidy).all(axis=0)
sample_ploidy = xr.where(sample_ploidy_fixed, call_genotype_ploidy[0,:], -1) # shape (samples,)
variant_ploidy_fixed = (call_genotype_ploidy[:,0] == call_genotype_ploidy.T).all(axis=0)
variant_ploidy = xr.where(variant_ploidy_fixed, call_genotype_ploidy[:,0], -1) # shape (variants,)
Checking if a dataset has consistent ploidy
The most difficult aspect of this proposal is ensuring that mixed-ploidy datasets do not get used in a function that only supports diploid (or fixed ploidy) data.
Simply checking the ploidy dimension size (e.g.) would no longer be reliable because it would only indicate the maximum ploidy.
Calculating the actual ploidy of a (potentially mixed ploidy) dataset will ultimately require checking all genotype calls in the dataset for a value <= -2
.
A workaround for this issue is to trust the user to specify if their data is mixed-ploidy or not during dataset creation by adding a mixed_ploidy
argument (default=False
) to create_genotype_call_dataset
(or reader functions).
If this argument is False then sample_ploidy
etc. can simply return the size of the ploidy dimension.
If it is true then sample_ploidy
etc. are caluclated as above.
The value of the mixed_ploidy
argument would be stored as a dataset attribute to enable a quick check for fixed-ploidy only functions.
Functions that require ploidy to be fixed in the sample or variant dimension can check sample_ploidy
or variant_ploidy
repectively which will be quick if mixed_ploidy=False
.
The alternative to letting the user specify mixed_ploidy
is to check the call_genotype
array for values <= -2
.
Implications for existing code
This proposal would add a mixed_ploidy
argument and attribute to create_genotype_call_dataset
with a default value of False
.
It would also insert the following dask arrays into a genotype_call_dataset
:
call_genotype_non_allele
with shape(variants, samples, ploidy)
call_genotype_ploidy
with shape(variants, samples)
sample_ploidy
with shape(samples,)
variant_ploidy
with shape(variants,)
Existing code that checks for the ploidy of a dataset would need to check if the dataset is mixed-ploidy via the new attribute.
Some functions including display_genotypes
would need to be adapted to handle mixed ploidy data.
Issue Analytics
- State:
- Created 3 years ago
- Comments:13
Top GitHub Comments
cyvcf2 v-0.20.8 now supports mixed ploidy in
Genotype.array()
also using-2
to indicate non-alleles.I have a very basic implementation of
vcf_to_zarr_sequential
that supports mixed ploidy. This requires the user to specify a maximum ploidy and will truncate genotypes if they exceed this maximum. I think this is the most practical way to go, probably throwing a warning if a genotype is truncated.It looks like this should probably wait on #258
That sounds great @timothymillar. I’ve just submitted #289, so the VCF migration should hopefully be done before too long.
BTW I had to make a small change to a test due to the mixed ploidy change, which shows that your change has been picked up!