Question regarding feature metadata
See original GitHub issueFirst of all, thank your for putting such an effort to share your great work. Really appreciate that!
I am considering using this benchmark to evaluate some neural network approaches, though I found that there is no distinction between integer features that were originally categorical information (education, relationship) and natural integers (rating, score, age, etc). The latter present sequential information, while categorical features often don’t. In neural nets this particular distinction is rather important. My question is: Is it possible to retrieve the original feature dtypes? In other words, is it possible to distinguish between categorical and quantitative integers?
For instance, in the Irish
dataset we have “Prestige_score:discrete” and “Type_school:discrete”. Both are integers, though “Type_school” is categorical while “Prestige_score” is quantitative.
I could make use of the original datasets as well if you have them.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top GitHub Comments
@lacava @trang1618 Thank you for the answer! I took a look at the referred branch, and I guess one should update the metadata.yaml to include that information, maybe something like the example below:
I can try to help retrieving that kind of information, and submit a PR whenever PMLB 2.0 is stable enough.
A possible way to do this could be something like:
I know that
fetch_data
has the defaultdropna=True
but in case there arenp.nan
present you can do something like this and then apply an encoder depending on the Dtype. Maybe someone from the Epistasis team can vet this, here the YAML parsed metadata is pulled in via an external URL, but that might not be necessary if something similar gets incorporated into the package.