question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question regarding feature metadata

See original GitHub issue

First of all, thank your for putting such an effort to share your great work. Really appreciate that!

I am considering using this benchmark to evaluate some neural network approaches, though I found that there is no distinction between integer features that were originally categorical information (education, relationship) and natural integers (rating, score, age, etc). The latter present sequential information, while categorical features often don’t. In neural nets this particular distinction is rather important. My question is: Is it possible to retrieve the original feature dtypes? In other words, is it possible to distinguish between categorical and quantitative integers?

For instance, in the Irish dataset we have “Prestige_score:discrete” and “Type_school:discrete”. Both are integers, though “Type_school” is categorical while “Prestige_score” is quantitative.

I could make use of the original datasets as well if you have them.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

2reactions
jwehrmanncommented, May 19, 2020

@lacava @trang1618 Thank you for the answer! I took a look at the referred branch, and I guess one should update the metadata.yaml to include that information, maybe something like the example below:

- name: age
  type: continuous
  description: null # optional but recommended, what the feature measures/indicates, unit
  code: null # optional, coding information, e.g., Control = 0, Case = 1
  transform: ~ # optional, any transformation performed on the feature, e.g., log scaled
  nature: ordinal 
- name: workclass
  type: discrete
  nature: categorical

I can try to help retrieving that kind of information, and submit a PR whenever PMLB 2.0 is stable enough.

1reaction
flaclecommented, Feb 16, 2022

A possible way to do this could be something like:

# check metadata and apply suitable pandas Dtype that supports NaN
# is categorical/binary? convert to category with strings of integers
# is ordinal? convert to object with integers 
# is continuous? maintain float64 as this supports NaN by default
def applyDType(df, dfName):
  url = 'https://raw.githubusercontent.com/EpistasisLab/pmlb/master/datasets/'
  url = url + dfName + '/metadata.yaml'
  dsmd = urllib.request.urlopen(url)
  dsyl = yaml.load(dsmd)['features']
  for c in df.columns:
    for f in dsyl:
      if c == f['name']:
        ft = f['type']
        if ft == 'categorical' or ft == 'binary':
          df[c] = df[c].astype('category')
          df[c] = floatToStrCol(df[c])
        if ft == 'ordinal':
          df[c] = df[c].astype('object')
          df[c] = floatToIntCol(df[c])
  return df.copy(deep=True)

I know that fetch_data has the default dropna=True but in case there are np.nan present you can do something like this and then apply an encoder depending on the Dtype. Maybe someone from the Epistasis team can vet this, here the YAML parsed metadata is pulled in via an external URL, but that might not be necessary if something similar gets incorporated into the package.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Questions Metadata Can Answer - TDAN.com
Data Movement Metadata · Where did my data originate? · What field was used to populate this data or was the field derived?...
Read more >
Frequently-asked questions on FGDC metadata - Geology
How do I create metadata? What tools are available to create metadata? What tools are available to check the structure of metadata? What...
Read more >
METADATA FREQUENTLY ASKED QUESTIONS
FREQUENTLY ASKED QUESTIONS. Q. What is Metadata? A. Strictly defined, metadata is data about data. There are two broad views of metadata and ......
Read more >
All You Need to Know About Metadata - Opendatasoft
As Jason notes in a previous blog post, in almost all cases metadata is answering the basic questions of who, what, where, when,...
Read more >
116 questions with answers in METADATA | Science topic
asked a question related to Metadata ... was made to describe objects rather than numbers attached to them (price, quantity, measurements and features)....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found