Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question regarding feature metadata

See original GitHub issue

First of all, thank your for putting such an effort to share your great work. Really appreciate that!

I am considering using this benchmark to evaluate some neural network approaches, though I found that there is no distinction between integer features that were originally categorical information (education, relationship) and natural integers (rating, score, age, etc). The latter present sequential information, while categorical features often don’t. In neural nets this particular distinction is rather important. My question is: Is it possible to retrieve the original feature dtypes? In other words, is it possible to distinguish between categorical and quantitative integers?

For instance, in the Irish dataset we have “Prestige_score:discrete” and “Type_school:discrete”. Both are integers, though “Type_school” is categorical while “Prestige_score” is quantitative.

I could make use of the original datasets as well if you have them.

Issue Analytics

State:
Created 3 years ago
Comments:5

Top GitHub Comments

2reactions

jwehrmanncommented, May 19, 2020

@lacava @trang1618 Thank you for the answer! I took a look at the referred branch, and I guess one should update the metadata.yaml to include that information, maybe something like the example below:

- name: age
  type: continuous
  description: null # optional but recommended, what the feature measures/indicates, unit
  code: null # optional, coding information, e.g., Control = 0, Case = 1
  transform: ~ # optional, any transformation performed on the feature, e.g., log scaled
  nature: ordinal 
- name: workclass
  type: discrete
  nature: categorical

I can try to help retrieving that kind of information, and submit a PR whenever PMLB 2.0 is stable enough.

1reaction

flaclecommented, Feb 16, 2022

A possible way to do this could be something like:

# check metadata and apply suitable pandas Dtype that supports NaN
# is categorical/binary? convert to category with strings of integers
# is ordinal? convert to object with integers 
# is continuous? maintain float64 as this supports NaN by default
def applyDType(df, dfName):
  url = 'https://raw.githubusercontent.com/EpistasisLab/pmlb/master/datasets/'
  url = url + dfName + '/metadata.yaml'
  dsmd = urllib.request.urlopen(url)
  dsyl = yaml.load(dsmd)['features']
  for c in df.columns:
    for f in dsyl:
      if c == f['name']:
        ft = f['type']
        if ft == 'categorical' or ft == 'binary':
          df[c] = df[c].astype('category')
          df[c] = floatToStrCol(df[c])
        if ft == 'ordinal':
          df[c] = df[c].astype('object')
          df[c] = floatToIntCol(df[c])
  return df.copy(deep=True)

I know that fetch_data has the default dropna=True but in case there are np.nan present you can do something like this and then apply an encoder depending on the Dtype. Maybe someone from the Epistasis team can vet this, here the YAML parsed metadata is pulled in via an external URL, but that might not be necessary if something similar gets incorporated into the package.

Top Results From Across the Web

Questions Metadata Can Answer - TDAN.com

Data Movement Metadata · Where did my data originate? · What field was used to populate this data or was the field derived?...

Frequently-asked questions on FGDC metadata - Geology

How do I create metadata? What tools are available to create metadata? What tools are available to check the structure of metadata? What...

METADATA FREQUENTLY ASKED QUESTIONS

FREQUENTLY ASKED QUESTIONS. Q. What is Metadata? A. Strictly defined, metadata is data about data. There are two broad views of metadata and ......

All You Need to Know About Metadata - Opendatasoft

As Jason notes in a previous blog post, in almost all cases metadata is answering the basic questions of who, what, where, when,...

116 questions with answers in METADATA | Science topic

asked a question related to Metadata ... was made to describe objects rather than numbers attached to them (price, quantity, measurements and features)....