pdbbind scaffold test set appears to be truncated
See original GitHub issuedeepchem/contrib/atomicconv/acnn/refined/get_acnn_refined.sh
test = dc.data.DiskDataset("datasets/scaffold_test")
print len(test.ids)
This test set only has 708 entries (others have ~740), and they stop exactly at 8pK (there are no high affinity compounds).
It seems unlikely that high affinity compounds (at precisely the 8.00 cutoff) are missing because of scaffold clustering (as opposed to the list getting truncated somewhere). It’s pretty suboptimal for more than a third of the affinity range to be missing from the test set.
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Beginner's Guide to the PDBbind Database (v.2020)
The PDBbind database provides a comprehensive collection of experimental binding affinity data for the biomolecular complexes in the Protein Data Bank (PDB) ...
Read more >On the frustration to predict binding affinities from protein ...
They evidence, to our viewpoint, potential biases in the composition of the PDBbind training/test sets suggesting that the derived models have partly memorized ......
Read more >RASPD+: Fast Protein-Ligand Binding Free Energy Prediction ...
Our method, Rapid Screening with Physicochemical Descriptors + machine learning (RASPD+), is trained on PDBbind data and achieves a regression ...
Read more >OnionNet: a multiple-layer inter-molecular contact based ...
High binding affinity between a small molecule or a short peptide to a receptor protein is a one of the major selecting criteria...
Read more >Does a More Precise Chemical Description of Protein ...
growing number of studies showing the benefits of machine ... benchmarked on a common PDBbind test set,44 which permits.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
What split makes the most sense depends on what sort of generalization error you are trying to measure. If you are interested in how well you will generalize to new targets, you should split by targets (which is what we do, with a significant difference in sequence identity). If you are interested in how well you generalize to new chemotypes, a scaffold split makes sense (although this is a bit tricky; e.g. compounds with different scaffolds may still have the same “warheads”).
I don’t find the time split or the non-core/core split particularly attractive, but to each their own. They also aren’t particularly amenable to cross-validation or bootstrapping.
Closing this old discussion. Feel free to re-open if there are new points to consider.