Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pdbbind scaffold test set appears to be truncated

See original GitHub issue

deepchem/contrib/atomicconv/acnn/refined/get_acnn_refined.sh

     test = dc.data.DiskDataset("datasets/scaffold_test")                                                            
     print len(test.ids)

This test set only has 708 entries (others have ~740), and they stop exactly at 8pK (there are no high affinity compounds).

It seems unlikely that high affinity compounds (at precisely the 8.00 cutoff) are missing because of scaffold clustering (as opposed to the list getting truncated somewhere). It’s pretty suboptimal for more than a third of the affinity range to be missing from the test set.

Issue Analytics

State:
Created 6 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

dkoescommented, Nov 16, 2017

What split makes the most sense depends on what sort of generalization error you are trying to measure. If you are interested in how well you will generalize to new targets, you should split by targets (which is what we do, with a significant difference in sequence identity). If you are interested in how well you generalize to new chemotypes, a scaffold split makes sense (although this is a bit tricky; e.g. compounds with different scaffolds may still have the same “warheads”).

I don’t find the time split or the non-core/core split particularly attractive, but to each their own. They also aren’t particularly amenable to cross-validation or bootstrapping.

0reactions

rbharathcommented, Jan 18, 2020

Closing this old discussion. Feel free to re-open if there are new points to consider.

Top Results From Across the Web

Beginner's Guide to the PDBbind Database (v.2020)

The PDBbind database provides a comprehensive collection of experimental binding affinity data for the biomolecular complexes in the Protein Data Bank (PDB) ...

On the frustration to predict binding affinities from protein ...

They evidence, to our viewpoint, potential biases in the composition of the PDBbind training/test sets suggesting that the derived models have partly memorized ......

RASPD+: Fast Protein-Ligand Binding Free Energy Prediction ...

Our method, Rapid Screening with Physicochemical Descriptors + machine learning (RASPD+), is trained on PDBbind data and achieves a regression ...

OnionNet: a multiple-layer inter-molecular contact based ...

High binding affinity between a small molecule or a short peptide to a receptor protein is a one of the major selecting criteria...

Does a More Precise Chemical Description of Protein ...

growing number of studies showing the benefits of machine ... benchmarked on a common PDBbind test set,44 which permits.