CodonOptimize mode 'harmonized' only optimizes first 1/3 of the sequence
See original GitHub issueUsing the CodonOptimize ‘harmonized’ functionality, I do not get a codon distribution similar to the one I specified. It seems that the second 2/3s of a sequence are never optimized.
A minimal example:
from dnachisel import *
problem = DnaOptimizationProblem(
sequence='GACGACGACAAAAAAAAAAAAAAAAAA',
constraints=[EnforceTranslation()],
objectives=[CodonOptimize(species='b_subtilis', mode='harmonized')]
)
problem.resolve_constraints()
problem.optimize()
print('SEQUENCE:', problem.sequence)
frequencies, positions = biotools.biotools.codons_frequencies_and_positions(problem.sequence)
print(*frequencies.items(), sep='\n')
This sequence encodes DDDKKKKKK and gives as output:
SEQUENCE: GACGATGATAAAAAAAAAAAAAAAAAA
('V', {'total': 0, 'GTA': 0.0, 'GTT': 0.0, 'GTC': 0.0, 'GTG': 0.0})
('L', {'CTT': 0.0, 'TTA': 0.0, 'CTC': 0.0, 'CTG': 0.0, 'CTA': 0.0, 'total': 0, 'TTG': 0.0})
('Q', {'CAG': 0.0, 'total': 0, 'CAA': 0.0})
('G', {'total': 0, 'GGC': 0.0, 'GGA': 0.0, 'GGT': 0.0, 'GGG': 0.0})
('P', {'CCA': 0.0, 'total': 0, 'CCT': 0.0, 'CCG': 0.0, 'CCC': 0.0})
('A', {'total': 0, 'GCG': 0.0, 'GCC': 0.0, 'GCT': 0.0, 'GCA': 0.0})
('T', {'total': 0, 'ACT': 0.0, 'ACA': 0.0, 'ACG': 0.0, 'ACC': 0.0})
('C', {'total': 0, 'TGT': 0.0, 'TGC': 0.0})
('S', {'AGC': 0.0, 'AGT': 0.0, 'TCT': 0.0, 'total': 0, 'TCG': 0.0, 'TCC': 0.0, 'TCA': 0.0})
('N', {'AAC': 0.0, 'total': 0, 'AAT': 0.0})
('H', {'CAC': 0.0, 'total': 0, 'CAT': 0.0})
('I', {'total': 0, 'ATA': 0.0, 'ATT': 0.0, 'ATC': 0.0})
('D', {'GAT': 0.6666666666666666, 'total': 3, 'GAC': 0.3333333333333333})
('M', {'total': 0, 'ATG': 0.0})
('F', {'TTT': 0.0, 'total': 0, 'TTC': 0.0})
('R', {'CGG': 0.0, 'AGG': 0.0, 'CGC': 0.0, 'CGA': 0.0, 'total': 0, 'AGA': 0.0, 'CGT': 0.0})
('K', {'total': 6, 'AAG': 0.0, 'AAA': 1.0})
('E', {'total': 0, 'GAG': 0.0, 'GAA': 0.0})
('*', {'total': 0, 'TAA': 0.0, 'TGA': 0.0, 'TAG': 0.0})
('Y', {'total': 0, 'TAC': 0.0, 'TAT': 0.0})
('W', {'total': 0, 'TGG': 0.0})
It does well for Aspartic Acid (D), as this has a GAT 0.64 / GAC 0.36 ratio, but for K with a AAA 0.7 / AAG 0.3 ratio, it does nothing at all. This is position dependent, as a sequence encoding KKKDDDDDD does the opposite.
While trying to debug, I printed all variants that were used during exhaustive search for this sequence (line 454 of DnaOptimizationProblem.py) , which are:
GACGACGACAAAAAAAAAAAAAAAAAA
GATGACGACAAAAAAAAAAAAAAAAAA
GATGACGACAAAAAAAAAAAAAAAAAA
GATGACGATAAAAAAAAAAAAAAAAAA
GATGACGATAAAAAAAAAAAAAAAAAA
GATGATGATAAAAAAAAAAAAAAAAAA
This again points to the fact that the later positions are never included in the variants to be analysed.
I’m using the latest version, dnachisel 1.1.
It would be great if you could find out where the problem is and come up with a quick fix, so I can use your library for my thesis. Thank you!
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
Regarding the naming, I am thinking of changing the terms in the future to “rank_harmonize” (match codon ranks in original host and target) and “frequence_harmonize” (the current algorithm). I’ll make the necessary version bumps and warnings.
Ok this is fixed on Github and PyPI. I have also changed the code so that now you have the convenient
compare_frequencies
method to check the final result:Output:
Note that for this particular optimization there is a risk that DnaChisel introduces a bit of codon bias, because it solves left-to-right (which may cause spatial bias) and because
reverse_translate
always uses the same codons. to be clear, I am not certain this is the case. In any case I have added an option to randomize the codons when reverse-translating the protein sequence:reverse_translate(sequence, randomize_codons=True)
.Let me know if that works for you!