question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mixed ploidy genotype_call_dataset

See original GitHub issue

Overview

This is a proposal to support mixed-ploidy data within a single genotype_call_dataset. It aims to have minimal impact on existing code and to avoid causing issues when implementing diploid only functionality in future. Criticism is very welcome!

Within the VCF specification, mixed-ploidy data can be defined by using a mixture of genotype string (GT) lengths. For example the following encodes one diploid and two tetraploid individuals:

#CHROM    POS    ID    REF    ALT   QUAL    FILTER    INFO         FORMAT    SAMPLE1    SAMPLE2    SAMPLE3
1         1000   .     C      G     60      PASS      NS=3;DP=30   GT        0/0/0/0    0/0        0/0/0/1
1         2000   .     T      A     60      PASS      NS=3;DP=30   GT        0/0/0/0    0/1        0/1/./.

Implementation

I think that the simplest approach to supporting mixed ploidy in a genotype_call_dataset is to introduce a second sentinel value e.g. -2 which indicates a non-allele. The above VCF would translate to the array:

call_genotype = da.array([
    [[0, 0, 0, 0], [0, 0, -2, -2], [0, 0, 0, 1]], 
    [[0, 0, 0, 0], [0, 1, -2, -2], [0, 1, -1, -1]]
])

A convenience mask could also be created, similar to the one indicating missing values:

call_genotype_mask = call_genotype == -1
call_genotype_non_allele = call_genotype < -1

we can also include the ploidy of each genotype call:

call_genotype_ploidy = (~call_genotype_non_allele).sum(axis=-1)  # shape (variants, samples)

and ploidy across each sample or variant using -1 to indicate mixed/inconsistent ploidy:

sample_ploidy_fixed = (call_genotype_ploidy[0,:] == call_genotype_ploidy).all(axis=0)
sample_ploidy = xr.where(sample_ploidy_fixed, call_genotype_ploidy[0,:], -1)  # shape (samples,)

variant_ploidy_fixed = (call_genotype_ploidy[:,0] == call_genotype_ploidy.T).all(axis=0)
variant_ploidy = xr.where(variant_ploidy_fixed, call_genotype_ploidy[:,0], -1)  # shape (variants,)

Checking if a dataset has consistent ploidy

The most difficult aspect of this proposal is ensuring that mixed-ploidy datasets do not get used in a function that only supports diploid (or fixed ploidy) data. Simply checking the ploidy dimension size (e.g.) would no longer be reliable because it would only indicate the maximum ploidy. Calculating the actual ploidy of a (potentially mixed ploidy) dataset will ultimately require checking all genotype calls in the dataset for a value <= -2.

A workaround for this issue is to trust the user to specify if their data is mixed-ploidy or not during dataset creation by adding a mixed_ploidy argument (default=False) to create_genotype_call_dataset (or reader functions). If this argument is False then sample_ploidy etc. can simply return the size of the ploidy dimension. If it is true then sample_ploidy etc. are caluclated as above. The value of the mixed_ploidy argument would be stored as a dataset attribute to enable a quick check for fixed-ploidy only functions. Functions that require ploidy to be fixed in the sample or variant dimension can check sample_ploidy or variant_ploidy repectively which will be quick if mixed_ploidy=False.

The alternative to letting the user specify mixed_ploidy is to check the call_genotype array for values <= -2.

Implications for existing code

This proposal would add a mixed_ploidy argument and attribute to create_genotype_call_dataset with a default value of False. It would also insert the following dask arrays into a genotype_call_dataset :

  • call_genotype_non_allele with shape (variants, samples, ploidy)
  • call_genotype_ploidy with shape (variants, samples)
  • sample_ploidy with shape (samples,)
  • variant_ploidy with shape (variants,)

Existing code that checks for the ploidy of a dataset would need to check if the dataset is mixed-ploidy via the new attribute. Some functions including display_genotypes would need to be adapted to handle mixed ploidy data.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13

github_iconTop GitHub Comments

3reactions
timothymillarcommented, Sep 30, 2020

cyvcf2 v-0.20.8 now supports mixed ploidy in Genotype.array() also using -2 to indicate non-alleles.

I have a very basic implementation of vcf_to_zarr_sequential that supports mixed ploidy. This requires the user to specify a maximum ploidy and will truncate genotypes if they exceed this maximum. I think this is the most practical way to go, probably throwing a warning if a genotype is truncated.

It looks like this should probably wait on #258

1reaction
tomwhitecommented, Oct 1, 2020

That sounds great @timothymillar. I’ve just submitted #289, so the VCF migration should hopefully be done before too long.

BTW I had to make a small change to a test due to the mixed ploidy change, which shows that your change has been picked up!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Model‐based genotype and ancestry estimation for ...
The model has high accuracy and sensitivity as verified with simulated data and through analysis of admixture among populations of diploid and ...
Read more >
A fully automated pipeline for quantitative genotype calling ...
VCF2SM was first intended for polyploid species, but it can be used for hybrids or outcrossing diploid species if researchers wish to get...
Read more >
SNP genotyping and parameter estimation in polyploids using ...
We present two new models for estimating genotypes and population genetic parameters from genotype likelihoods for auto- and allopolyploids. We then use ...
Read more >
Analysis of genome data - Grunwald lab
This means that in theory VCF data may contain data that is of mixed ploidy. In a genlight object different samples may be...
Read more >
Using genotype probabilities in polymapR
Probabilistic genotypes are a standard output from many polyploid genotype calling procedures. This vignette assumes the R package fitPoly has ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found