question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pdbbind scaffold test set appears to be truncated

See original GitHub issue

deepchem/contrib/atomicconv/acnn/refined/get_acnn_refined.sh

     test = dc.data.DiskDataset("datasets/scaffold_test")                                                            
     print len(test.ids)

This test set only has 708 entries (others have ~740), and they stop exactly at 8pK (there are no high affinity compounds).

It seems unlikely that high affinity compounds (at precisely the 8.00 cutoff) are missing because of scaffold clustering (as opposed to the list getting truncated somewhere). It’s pretty suboptimal for more than a third of the affinity range to be missing from the test set.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
dkoescommented, Nov 16, 2017

What split makes the most sense depends on what sort of generalization error you are trying to measure. If you are interested in how well you will generalize to new targets, you should split by targets (which is what we do, with a significant difference in sequence identity). If you are interested in how well you generalize to new chemotypes, a scaffold split makes sense (although this is a bit tricky; e.g. compounds with different scaffolds may still have the same “warheads”).

I don’t find the time split or the non-core/core split particularly attractive, but to each their own. They also aren’t particularly amenable to cross-validation or bootstrapping.

0reactions
rbharathcommented, Jan 18, 2020

Closing this old discussion. Feel free to re-open if there are new points to consider.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Beginner's Guide to the PDBbind Database (v.2020)
The PDBbind database provides a comprehensive collection of experimental binding affinity data for the biomolecular complexes in the Protein Data Bank (PDB) ...
Read more >
On the frustration to predict binding affinities from protein ...
They evidence, to our viewpoint, potential biases in the composition of the PDBbind training/test sets suggesting that the derived models have partly memorized ......
Read more >
RASPD+: Fast Protein-Ligand Binding Free Energy Prediction ...
Our method, Rapid Screening with Physicochemical Descriptors + machine learning (RASPD+), is trained on PDBbind data and achieves a regression ...
Read more >
OnionNet: a multiple-layer inter-molecular contact based ...
High binding affinity between a small molecule or a short peptide to a receptor protein is a one of the major selecting criteria...
Read more >
Does a More Precise Chemical Description of Protein ...
growing number of studies showing the benefits of machine ... benchmarked on a common PDBbind test set,44 which permits.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found