Support for DGL Modeling
See original GitHub issueHi DeepChem team,
This is Mufei from the DGL team. I’ve also spent some time developing DGL-LifeSci – a DGL-based package for working with graphs in chemistry and biology. It seems that DeepChem has started supporting DGL-based modeling with PyTorch (e.g. #2089 by @nd-02110114 ), which is rather exciting! I’ve had some chats with @rbharath before about contributing to this effort. Below are some observations & proposals and I’d like to know your thoughts.
Compared with pure PyTorch-based modeling, DGL-based modeling additionally requires:
- Converting graph data into DGL’s data structure
DGLGraph
and storing node/edge features inDGLGraph.ndata
andDGLGraph.edata
- Using APIs like
DGLGraph.update_all
for invoking message passing over graphs in NN modules
For the first point, DeepChem employs GraphData
as an intermediate graph representation across different frameworks (DGL, PyTorch Geometric, etc.). It allows graph creation from a COO format along with pre-processed features and exposes a to_dgl_graph
API for converting a GraphData
instance into a DGLGraph
instance. For the second point, DeepChem implements DGL-based PyTorch models under deepchem/deepchem/models/torch_models
.
For a better support of DGL-based modeling and more generally, graph-based modeling, there are several possible points:
- Support for APIs like
from_dgl_graph
andfrom_pyg_graph
, which can be helpful for users familiar with DGL/PyG before. - A simple interface for custom dataset. This can be something like a variant of CSVLoader, which directly constructs a graph dataset from files of a standard data format like
CSV
. It allows users to specify the type of graph to construct as well as the way to featurize their nodes/edges. DGL-LifeSci’s MoleculeCSVDataset can be an example for this. - Functions for constructing standard types of graphs (molecular graphs, complete graphs, KNN graphs, distance-based graphs) from raw data like SMILES strings. While graph creation from a COO format maximizes the flexibility, it can be convenient to have such functions for users who only want to try existing modeling approaches on their own datasets. I think the design of Protein Graph Library follows a similar idea.
- Support for heterogeneous graphs, i.e. graphs of typed nodes and edges. I have the impression that DeepChem is mainly for molecular property prediction and a bit protein-ligand binding affinity prediction, so maybe this is less an issue. However, this can still be helpful even for molecular property prediction when one wants to combine information from different graph structures. Molecule Attention Transformer is an example for this.
- Examples for model training and evaluation on MoleculeNet
- Additionally, DGL-LifeSci has implemented some graph neural networks here and I’d like to know if you are open to directly import them in DeepChem.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:12 (12 by maintainers)
Top GitHub Comments
Great to see you on here @mufeili! I’m excited to see us improve DGL support/integration moving forward 😃
+1 to this! @nd-02110114 has kindly taken the lead on our refactoring to use
GraphData
as a common substrate. Once that’s merged in, we should be able to support more of DGL’s APIs.I think our current dataloader/featurizer pipeline would support these use cases right? Taking a look at
MoleculeCSVDataset
, I think the analog would be ourInMemoryLoader
https://deepchem.readthedocs.io/en/latest/dataloaders.html#inmemoryloader which allows for loading of data from Pandas dataframes as well. Is there any useful functionality that we’re missing here?+1 to this as well! Props to @nd-02110114 for taking the lead here as well 😃
We’ve definitely talked about getting Molecule Attention Transformer support in DeepChem! I’d love to see MAT and similar models supported 😃
For more heterogenous graphs, I’m not sure. DeepChem’s focus is on scientific deep learning applications. Is there a good scientific use case for heterogenous graphs beyond MAT? As @nd-02110114 noted above, this may be out of scope for us if there isn’t a clear use case.
This is definitely on our priorities! @nd-02110114 is actively working on this already, and I’d love to see more MoleculeNet benchmarks for all DGL models be put up.
We’d be very opening to directly importing DGL-LifeSci’s models in DeepChem! It might also be useful to write small wrapper classes wrapping DGL models in
TorchModel
for ease of benchmarking on MoleculeNet or interoperating with the rest of the DeepChem API. In general though, I’d love to see closer integration with DGL-LifeSci’s models. We’re both working on similar problems and better to join forces and bring more value to our communityThanks for your comments! These comments are really helpful for us.
The following comments are my personal opinion.
I agree. Before doing this, we need to refactor the present DeepChem’s graph models using GraphData class. I will work for this refactoring in this month. I seem the priority is high.
I think these feature can be achieved by refactoring the GraphData and the featurizer which is implemented for molecules. This makes our graph model support more general, so I will definitely try to implement. (I like the design of Protein Graph Library, so I will imitate the API design) I seem the priority is intermediate.
To be honest, I also seem that this feature is basically out of scope. But, I’m interested in Molecule Attention Transformer. If we have a time, we will try to support. I seem the priority is low.
This is a working progress and highest priority task. Currently, I’m checking whether the model is working well with GPU or a large dataset. After finishing, I will add more details to deepchem docs.
I seem this is open. But, currently, it is impossible to use DGL-LifeSci models with no modification. How about @rbharath?