question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add support for multiple sequence alignment, homology modeling, and deep structural prediction

See original GitHub issue

Introduction

DeepChem’s support for protein-based deep learning is still pretty early-stage. In this issue I propose a framework for adding support to DeepChem for doing more systematic machine learning with proteins. In particular, I propose the addition of new featurizers which compute multiple sequence alignment, mutual-information/multiple contact prediction, features. I also propose the addition of a new homology modeling class which allows the construction of homology models with a DeepChem API.

These proposed modifications are inspired by @miaecle’s work on https://github.com/miaecle/PNet. By bringing in some of PNet’s infrastructure into DeepChem, we can leverage DeepChem’s infrastructure to build larger scale deep models for protein structure prediction. We should also be able to build protein structure predictions models in DeepChem as is possible in PNet (https://github.com/miaecle/PNet/blob/master/examples/atrous_conv_example.py)

Featurizers

I propose the addition/modification of four featurization classes to DeepChem:

MoleculeNet Datasets

I propose that we add 4 new datasets to MoleculeNet following the PNet implementation:

Metrics

I propose we add the following metric:

Homology Modelling

At present the most popular tool for homology modelling is the Modeller package. This package is well tested and widely used, but has the difficult limitation that it is only free for academic non-profit users and isn’t open source. Given this limitation, I propose that it would be useful to create the capability to generate homology models in DeepChem. For this purpose I propose the addition of a new class to DeepChem

  • dc.dock.HomologyModeller: Underneath the hood, this class will use hhmakemodel.py from hhsuite and MultipleSequenceAlignmentFeaturizer to construct homology models for a given sequence.

Models

I’m not yet sure what models we should add pre-built, but here is an example of PNet’s models that could serve as inspiration: https://github.com/miaecle/PNet/blob/master/pnet/models/conv_net_contact_map.py

License Considerations

An important consideration here is that hh-suite is GPL licensed. After some good discussion with @peastman and looking through the GPL FAQ’s, I believe that we can create MIT licensed code in DeepChem that links to the GPL hhsuite. See these FAQ pointers from @peastman:

Any particular DeepChem installation which also has hh-suite installed would become a GPL program, but the DeepChem source code itself could remain MIT licensed since MIT license is GPL compatible

Development Plan

For now this is a preliminary design I’m posting for feedback and comments from folks. @miaecle @peastman I’d love your comments on this design in particular! If this makes sense, we can then plan on the development roadmap for getting these features into DeepChem itself.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:3
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
peastmancommented, Aug 27, 2021

This is an interesting paper that uses a relatively simple approach to structure prediction. Geometric deep learning of RNA structure. Some notable features of it are

  • It uses only atomic coordinates, not evolutionary information
  • It’s not specific to any particular type of molecule
  • It requires very little training data

In this paper, they use it to predict RNA structure. They train it on 18 molecules whose structures were all determined before 2007. Based on that, it achieves state of the performance on a blind challenge predicting the structures of much larger RNA molecules.

0reactions
rbharathcommented, Sep 7, 2021

That paper is really cool! It might be good to incorporate more infrastructure to predict RNA structures into DeepChem

Read more comments on GitHub >

github_iconTop Results From Across the Web

Deep Learning-Based Advances in Protein Structure Prediction
In this section, we highlight DL-based advances in various steps (Figure 1) of the protein structure prediction pipeline viz. multiple sequence alignment, ...
Read more >
MULTICOM2 open-source protein structure prediction system ...
The template-based modeling uses sequence alignment tools with deep multiple sequence alignments to search for structural templates, ...
Read more >
DeepECA: an end-to-end learning framework for protein ...
... protein structure prediction, depend heavily on deep neural networks (DNNs) and multiple sequence alignments (MSAs) of target proteins.
Read more >
DeepMSA: constructing deep multiple sequence ... - PubMed
Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins.
Read more >
constructing deep multiple sequence alignment to improve ...
Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found