Add support for multiple sequence alignment, homology modeling, and deep structural prediction
See original GitHub issueIntroduction
DeepChem’s support for protein-based deep learning is still pretty early-stage. In this issue I propose a framework for adding support to DeepChem for doing more systematic machine learning with proteins. In particular, I propose the addition of new featurizers which compute multiple sequence alignment, mutual-information/multiple contact prediction, features. I also propose the addition of a new homology modeling class which allows the construction of homology models with a DeepChem API.
These proposed modifications are inspired by @miaecle’s work on https://github.com/miaecle/PNet. By bringing in some of PNet’s infrastructure into DeepChem, we can leverage DeepChem’s infrastructure to build larger scale deep models for protein structure prediction. We should also be able to build protein structure predictions models in DeepChem as is possible in PNet (https://github.com/miaecle/PNet/blob/master/examples/atrous_conv_example.py)
Featurizers
I propose the addition/modification of four featurization classes to DeepChem:
dc.feat.MultipleSequenceAlignmentFeaturizer
: This featurizer would compute multiple sequence alignments for a given sequence. The implementation should closely follow that of PNet https://github.com/miaecle/PNet/blob/master/pnet/feat/MSA.py#L30. Note that this depends on HHLBits under the hood.dc.feat.OneHotFeaturizer
: Generalize this class so that it works with protein sequences and not only small molecules. (We might as well generalize this to work with arbitrary sequences while we’re at it)dc.feat.MutualInformationFeaturizer
: This featurizer would compute mutual information a for a given sequence. The implementation should closely follow that of PNet https://github.com/miaecle/PNet/blob/master/pnet/feat/twoD_features.py#L43.dc.feat.MeanContactPotentialFeaturizer
: This featurizer would compute mean contact potential for a given sequence. The implementation should closely follow that of PNet https://github.com/miaecle/PNet/blob/master/pnet/feat/twoD_features.py#L43.
MoleculeNet Datasets
I propose that we add 4 new datasets to MoleculeNet following the PNet implementation:
- CATH: https://github.com/miaecle/PNet/blob/master/pnet/utils/sequence_utils.py#L73
- PDB50: https://github.com/miaecle/PNet/blob/master/pnet/utils/sequence_utils.py#L80
- CAMEO: https://github.com/miaecle/PNet/blob/master/pnet/utils/sequence_utils.py#L105
- CASP: (With sub datasets CASP5-CASP12) https://github.com/miaecle/PNet/blob/master/pnet/utils/sequence_utils.py#L55
Metrics
I propose we add the following metric:
dc.metrics.top_k_accuracy
: https://github.com/miaecle/PNet/blob/master/pnet/utils/metrics.py#L176
Homology Modelling
At present the most popular tool for homology modelling is the Modeller package. This package is well tested and widely used, but has the difficult limitation that it is only free for academic non-profit users and isn’t open source. Given this limitation, I propose that it would be useful to create the capability to generate homology models in DeepChem. For this purpose I propose the addition of a new class to DeepChem
dc.dock.HomologyModeller
: Underneath the hood, this class will usehhmakemodel.py
fromhhsuite
andMultipleSequenceAlignmentFeaturizer
to construct homology models for a given sequence.
Models
I’m not yet sure what models we should add pre-built, but here is an example of PNet’s models that could serve as inspiration: https://github.com/miaecle/PNet/blob/master/pnet/models/conv_net_contact_map.py
License Considerations
An important consideration here is that hh-suite
is GPL licensed. After some good discussion with @peastman and looking through the GPL FAQ’s, I believe that we can create MIT licensed code in DeepChem that links to the GPL hhsuite. See these FAQ pointers from @peastman:
- https://www.gnu.org/licenses/gpl-faq.html#WhatIsCompatible
- https://www.gnu.org/licenses/gpl-faq.html#WhatDoesCompatMean
- https://www.gnu.org/licenses/gpl-faq.html#LinkingWithGPL
Any particular DeepChem installation which also has hh-suite installed would become a GPL program, but the DeepChem source code itself could remain MIT licensed since MIT license is GPL compatible
Development Plan
For now this is a preliminary design I’m posting for feedback and comments from folks. @miaecle @peastman I’d love your comments on this design in particular! If this makes sense, we can then plan on the development roadmap for getting these features into DeepChem itself.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:9 (9 by maintainers)
Top GitHub Comments
This is an interesting paper that uses a relatively simple approach to structure prediction. Geometric deep learning of RNA structure. Some notable features of it are
In this paper, they use it to predict RNA structure. They train it on 18 molecules whose structures were all determined before 2007. Based on that, it achieves state of the performance on a blind challenge predicting the structures of much larger RNA molecules.
That paper is really cool! It might be good to incorporate more infrastructure to predict RNA structures into DeepChem