question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Move DatasetInfo from `datasets_infos.json` to the YAML tags in `README.md`

See original GitHub issue

Currently there are two places to find metadata for datasets:

  • datasets_infos.json, which contains per dataset config
    • description
    • citation
    • license
    • splits and sizes
    • checksums of the data files
    • feature types
    • and more
  • YAML tags, which contain
    • license
    • language
    • train-eval-index
    • and more

It would be nice to have a single place instead. We can rely on the YAML tags more than the JSON file for consistency with models. And it would all be indexed by our back-end directly, which is nice to have.

One way would be to move everything to the YAML tags except the checksums (there can be tens of thousands of them). The description/citation is already in the dataset card so we probably don’t need to have them in the YAML card, it would be redundant.

Here is an example for SQuAD


download_size: 35142551
dataset_size: 89789763
version: 1.0.0
splits:
- name: train
  num_examples: 87599
  num_bytes: 79317110
- name: validation
  num_examples: 10570
  num_bytes: 10472653
features:
- name: id
  dtype: string
- name: title
  dtype: string
- name: context
  dtype: string
- name: question
  dtype: string
- name: answers
  struct:
  - name: text
    list:
      dtype: string
  - name: answer_start
    list:
      dtype: int32

Since there is only one configuration for SQuAD, this structure is ok. For datasets with several configs we can see in a second step, but IMO it would be ok to have these fields per config using another syntax

configs:
- config: unlabeled
  splits:
  - name: train
    num_examples: 10000
  features:
  - name: text
    dtype: string
- config: labeled
  splits:
  - name: train
    num_examples: 100
  features:
  - name: text
    dtype: string
  - name: label
    dtype: ClassLabel
    names:
    - negative
    - positive

So in the end you could specify a YAML tag either at the top level (for all configs) or per config in the configs field

Alternatively we could keep config specific stuff in the dataset_infos.json as it it today

Not sure yet what’s the best approach here but cc @julien-c @mariosasko @albertvillanova @polinaeterna for feedback 😃

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:7
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

5reactions
julien-ccommented, Aug 26, 2022

Very supportive of this!

Nesting an array of configs inside dataset_infos: sounds good to me. One small tweak is that config: default can be optional for the default config (which can be the first one by convention)

We’ll be able to implement metadata validation on the Hub side so we ensure that those metadata are always in the right format (maybe for @coyotte508 ? cc @Pierrci). From a quick glance the features might be the harder part to validate here, any doc will be welcome.

Other high-level points:

  • as we move from mostly academic datasets to all datasets (which include the data inside the repos), my intuition is that more and more datasets (Hub-stored) are going to be single-config
  • similarly, less and less datasets will have a loading script, just the data + some metadata
  • to lower the barrier to entry to contribution, in the long term users shouldn’t need to compute/update this data via a command line. It could be filled automatically on the Hub through a “bot” inside Discussions & Pull requests for instance.
2reactions
julien-ccommented, Aug 26, 2022

Note also that the default config is not named default, afaiu, but create from the repo name

in case of single-config you can even hide the config name from the UI IMO

I dug into features validation, see: the OpenAPI spec

in moon-landing we use Joi to validate metadata so we would need to generate from Joi code from the OpenAPI spec (or from somewhere else) but I guess that’s doable – or just rewrite it manually, as it won’t change often

Read more comments on GitHub >

github_iconTop Results From Across the Web

vscode-yaml/README.md at main · redhat-developer ... - GitHub
enable : When set to true, the YAML language server will pull in all available schemas from JSON Schema Store; yaml.schemaStore.url : URL...
Read more >
How to style a JSON block in Github Wiki? - Stack Overflow
Is there a way to nicely format/style JSON code in Github Wiki (i.e Markdown preferred)?.
Read more >
YAML Superpowers, part 1: JSON is YAML
In this series, I'll focus on examples related to configuring a CI like BuildKite (because that's what we are currently migrating to where...
Read more >
markdown-to-json - npm
Tool for converting YAML Front Matter in Markdown files to JSON files. m2j is used to read a folder of Markdown files, pulling...
Read more >
What is README.md File? - GeeksforGeeks
Easily readable (in its raw state), unlike HTML, which is full of tags. Platform-independent – which means you can create Markdown files in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found