Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Move DatasetInfo from `datasets_infos.json` to the YAML tags in `README.md`

See original GitHub issue

Currently there are two places to find metadata for datasets:

datasets_infos.json, which contains per dataset config
- description
- citation
- license
- splits and sizes
- checksums of the data files
- feature types
- and more
YAML tags, which contain
- license
- language
- train-eval-index
- and more

It would be nice to have a single place instead. We can rely on the YAML tags more than the JSON file for consistency with models. And it would all be indexed by our back-end directly, which is nice to have.

One way would be to move everything to the YAML tags except the checksums (there can be tens of thousands of them). The description/citation is already in the dataset card so we probably don’t need to have them in the YAML card, it would be redundant.

Here is an example for SQuAD


download_size: 35142551
dataset_size: 89789763
version: 1.0.0
splits:
- name: train
  num_examples: 87599
  num_bytes: 79317110
- name: validation
  num_examples: 10570
  num_bytes: 10472653
features:
- name: id
  dtype: string
- name: title
  dtype: string
- name: context
  dtype: string
- name: question
  dtype: string
- name: answers
  struct:
  - name: text
    list:
      dtype: string
  - name: answer_start
    list:
      dtype: int32

Since there is only one configuration for SQuAD, this structure is ok. For datasets with several configs we can see in a second step, but IMO it would be ok to have these fields per config using another syntax

configs:
- config: unlabeled
  splits:
  - name: train
    num_examples: 10000
  features:
  - name: text
    dtype: string
- config: labeled
  splits:
  - name: train
    num_examples: 100
  features:
  - name: text
    dtype: string
  - name: label
    dtype: ClassLabel
    names:
    - negative
    - positive

So in the end you could specify a YAML tag either at the top level (for all configs) or per config in the configs field

Alternatively we could keep config specific stuff in the dataset_infos.json as it it today

Not sure yet what’s the best approach here but cc @julien-c @mariosasko @albertvillanova @polinaeterna for feedback 😃

Issue Analytics

State:
Created a year ago
Reactions:7
Comments:15 (15 by maintainers)

Top GitHub Comments

5reactions

julien-ccommented, Aug 26, 2022

Very supportive of this!

Nesting an array of configs inside dataset_infos: sounds good to me. One small tweak is that config: default can be optional for the default config (which can be the first one by convention)

We’ll be able to implement metadata validation on the Hub side so we ensure that those metadata are always in the right format (maybe for @coyotte508 ? cc @Pierrci). From a quick glance the features might be the harder part to validate here, any doc will be welcome.

Other high-level points:

as we move from mostly academic datasets to all datasets (which include the data inside the repos), my intuition is that more and more datasets (Hub-stored) are going to be single-config
similarly, less and less datasets will have a loading script, just the data + some metadata
to lower the barrier to entry to contribution, in the long term users shouldn’t need to compute/update this data via a command line. It could be filled automatically on the Hub through a “bot” inside Discussions & Pull requests for instance.

2reactions

julien-ccommented, Aug 26, 2022

Note also that the default config is not named default, afaiu, but create from the repo name

in case of single-config you can even hide the config name from the UI IMO

I dug into features validation, see: the OpenAPI spec

in moon-landing we use Joi to validate metadata so we would need to generate from Joi code from the OpenAPI spec (or from somewhere else) but I guess that’s doable – or just rewrite it manually, as it won’t change often