Move DatasetInfo from `datasets_infos.json` to the YAML tags in `README.md`
See original GitHub issueCurrently there are two places to find metadata for datasets:
- datasets_infos.json, which contains per dataset config
- description
- citation
- license
- splits and sizes
- checksums of the data files
- feature types
- and more
- YAML tags, which contain
- license
- language
- train-eval-index
- and more
It would be nice to have a single place instead. We can rely on the YAML tags more than the JSON file for consistency with models. And it would all be indexed by our back-end directly, which is nice to have.
One way would be to move everything to the YAML tags except the checksums (there can be tens of thousands of them). The description/citation is already in the dataset card so we probably don’t need to have them in the YAML card, it would be redundant.
Here is an example for SQuAD
download_size: 35142551
dataset_size: 89789763
version: 1.0.0
splits:
- name: train
num_examples: 87599
num_bytes: 79317110
- name: validation
num_examples: 10570
num_bytes: 10472653
features:
- name: id
dtype: string
- name: title
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: answers
struct:
- name: text
list:
dtype: string
- name: answer_start
list:
dtype: int32
Since there is only one configuration for SQuAD, this structure is ok. For datasets with several configs we can see in a second step, but IMO it would be ok to have these fields per config using another syntax
configs:
- config: unlabeled
splits:
- name: train
num_examples: 10000
features:
- name: text
dtype: string
- config: labeled
splits:
- name: train
num_examples: 100
features:
- name: text
dtype: string
- name: label
dtype: ClassLabel
names:
- negative
- positive
So in the end you could specify a YAML tag either at the top level (for all configs) or per config in the configs
field
Alternatively we could keep config specific stuff in the dataset_infos.json
as it it today
Not sure yet what’s the best approach here but cc @julien-c @mariosasko @albertvillanova @polinaeterna for feedback 😃
Issue Analytics
- State:
- Created a year ago
- Reactions:7
- Comments:15 (15 by maintainers)
Top GitHub Comments
Very supportive of this!
Nesting an array of configs inside
dataset_infos:
sounds good to me. One small tweak is thatconfig: default
can be optional for the default config (which can be the first one by convention)We’ll be able to implement metadata validation on the Hub side so we ensure that those metadata are always in the right format (maybe for @coyotte508 ? cc @Pierrci). From a quick glance the
features
might be the harder part to validate here, any doc will be welcome.Other high-level points:
in case of single-config you can even hide the config name from the UI IMO
in moon-landing we use Joi to validate metadata so we would need to generate from Joi code from the OpenAPI spec (or from somewhere else) but I guess that’s doable – or just rewrite it manually, as it won’t change often