FEP Module
See original GitHub issueI propose that we add a new FEP (free energy perturbation) module, as dc.fep
.
Introduction
Free energy perturbation has become an increasingly powerful technique in modern drug discovery. Starting with the publication of techniques like LOMAP and Schrodinger’s FEP+ technique, and the release of open source tools like Yank, free energy techniques have matured as powerful techniques for estimating the binding free energy of molecules to proteins. The basic idea of FEP is that is it’s possible to estimate the binding free energy of small changes to a system by using Zwanzig FEP identity (Github doesn’t support latex so this will be a little messy)
E_A[exp(-beta*Delta(U))] = exp(-beta Delta(F))
Here U
is an energy function. We assume we have two states A
and B
. Think of A
as the initial state, with energy function U_A
and B
as the ending state with energy function U_B
. The difference Delta(U) = U_B - U_A
The expectation E_A
is over the distribution specified by the density
p_A = C * exp(-beta U_A)
In practice, it’s possible to perform a molecular dynamics simulation to compute this expectation. This allows for the estimation of the change in free energy Delta(F) = F_B - F_A
from a simulation in A
. For this simulation to converge reasonably though, B
and A
should overlap considerable (since we are sampling from A
to estimate the density of B
). This has traditionally been sampling. There are techniques like MBAR (see https://github.com/choderalab/pymbar, https://github.com/alchemistry/alchemlyb) that help perform these calculations, but convergence can still be slow.
DeepMind recently came out with a fascinating paper Targeted Free Energy Estimation Via Learned Mappings that proposes a technique to help with this problem. The idea is to use a normalizing flow that transforms state A
to state A'
that has higher overlap with B
. A normalizing flow is a type of deep network that evolves probability distributions. The DeepMind paper trains a normalizing flow for a simple solute-solvent system. Training data is generated by a MD simulation and used to train the normalizing flow. Results on this simple system show that the use of then normalizing flow appears to considerably speed up the convergence of the system.
Proposed Changes
I propose we should add support for deep FEP models in DeepChem. Doing this would require the following steps:
- Adding support for normalizing flows: Normalizing flows behave a little differently from our supervised/metalearning/RL models since they evolve distributions. This would require some new infrastructure. There are luckily a number of reference normalizing flow implementations out there already which have permissive licenses (https://github.com/tonyduan/normalizing-flows), so we could likely leverage this code to build out some infrastructure.
- This code should probably go in
dc.models
although if models are different enough, we might need to make adc.normflow
submodule like we have fordc.rl
anddc.metalearning
.
- This code should probably go in
- We need to add additional metrics suitable for normalizing flow training and evaluation to
dc.metrics
. - We need to add new loss functions, the LBAR (Learned Bennett Acceptance Ratio) and LFEP (Learned Free Energy Perturbations) as defined in the DeepMind paper.
- To make deep FEP models more broadly useful, we need to generate larger and more interesting training datasets for them. These datasets should eventually find their way into
dc.molnet
. These datasets likely have to be generated by MD simulation of many protein/ligand systems. This will likely be a large effort and ideally we should find a way to tap already existing databases. Folding@Home has been running a lot of free energy calculations so they might have a suitable dataset already. - To apply FEP on systems of practical interest, we need a number of additional tools. When applying FEP on lead optimization problems, usually a series of related compounds are constructed. I propose that we add utilities to do this automatically, using CReM (https://github.com/DrrDom/crem), a library to construct chemically reasonable mutations of a starting compound. The general pattern here is of a
PerturbationGenerator
abstract class that generates perturbed versions of an initial state. The CReM class would probably be aMoleculePerturbationGenerator
concrete subclass. This should likely live in thedc.fep
module. - When applying FEP on regions of interest, it’s often crucial to select the region for simulation correctly so unneeded work isn’t performed. We have some utilities for automatic BindingPocket detection in
dc.docking
which are similar, but we might need to add more targeted tools for selecting the region of interest. - We need a way of running FEP on new systems of interest. One way to do this might be by adding a new
FEPEngine
class. Underneath the hood, this class should rely on Yank and OpenMM to the degree possible. This would live in thedc.fep
module.
Scope
Would these changes be in scope for DeepChem? One possibility to ask is wouldn’t this be a better contribution to Yank or a separate library? When eventually, as normalizing flow techniques mature, they will likely find their way into libraries like Yank that focus on free energy perturbation. But for the moment, normalizing flow techniques are very new. The DeepMind paper focuses on a very toy system. Considerable research and development will have to be done before these techniques are suitable for broader applications. This will involve a lot of model building, dataset gathering, benchmarking etc. As a scientific deep learning library, DeepChem is well suited to help facilitate these types of activities. I believe DeepMind has not open sourced their implementation, so creating a high quality reference implementation will help accelerate research in this field and get these techniques closer to practical applicability.
As a second question, it’s reasonable to ask whether this should be it’s own library instead of a part of DeepChem. The major advantage of building it within deepchem is that it’s easy to leverage and extend the work we’ve put into build/documentation/tooling around DeepChem. Bootstrapping a new library would be considerable work for a very experimental technique.
Another thing to note is that I’m in the middle of overhauling DeepChem’s support for structure based drug discovery at the moment. We have the new dc.docking
module and I’m working on extending atomic convolutions. There should be natural synergy between these efforts and dc.fep
that should be mutually beneficial.
Implementation
I’m willing to take the lead on implementing this feature, but since this is a large set of new features, any help other folks are interested in providing would be much appreciated. Also, everything I’ve laid out in this issue is just a first design sketch. Feedback and comments are very welcome!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:6
- Comments:9 (9 by maintainers)
Top GitHub Comments
I think the only thing we need would be a custom implementation of a circular spline bijector, which could either inherit from
tfp.bijectors.RationalQuadraticSpline
or use its code as a starting point and be modified to respect PBC.One thought here is that we might want to implement a
dc.dock
like module for programmatic FEP (calling say Yank or a new normalizing flow engine) under the hood. I don’t yet have a sense of how complicated this would be though