Unify API for loader functions
See original GitHub issueThe molnet loader functions are currently divided into two groups with inconsistent APIs for specifying featurizers and splitters.
Most of them have arguments featurizer
and split
to specify them. These arguments take one of a set of hardcoded strings, like featurizer='ECFP'
or split='random'
. There are a few problems here.
- The list of accepted values is undocumented.
- There’s also no way to discover them programmatically.
- There’s no way to specify a splitter or featurizer that isn’t on the list.
- The list of allowed options varies widely between datasets. If there’s a coherent set of rules behind them, I don’t know what it is.
Then there are the ones that use the template introduced in #1938. These work rather differently. The argument for specifying the splitter is called splitter
instead of split
. The arguments may take either the name of a class, the class itself, or an instance of the class. These functions have their own set of issues.
- The list of accepted values is again undocumented.
- It is possible to discover them programmatically (for example from
dc.molnet.load_function.zinc15_datasets.zinc15_splitters
), but the mechanism is itself undocumented. - Much of the documentation is incorrect. For example,
In fact the value should specify a single splitter, not “allowed splitters”. And it never mentions the possibility of passing a string or class.
- It isn’t clear to me that there’s really a lot of benefit from having so many options.
featurizer='CircularFeaturizer'
isn’t substantially clearer, shorter, or more convenient thanfeaturizer=dc.feat.CircularFeaturizer()
.
We should come up with a single consistent API for all of them.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:35 (35 by maintainers)
Top GitHub Comments
How about this design:
featurizer
. It can take either a Featurizer object or one of the special names.splitter
. I can take either a Splitter object or one of the special names.split
will be accepted as a deprecated synonym forsplitter
.I can do 1-5. The creators and maintainers of particular datasets will need to do 6.
I’ll take a crack at overhauling
load_pdbbind
so I can include it in the new tutorial on predicting protein-ligand binding with the new interaction fingerprints.