Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Implement ability to define splits in metadata section of dataset card

See original GitHub issue

Feature request

If you go here: https://huggingface.co/datasets/inria-soda/tabular-benchmark/tree/main you will see bunch of folders that has various CSV files. I’d like dataset viewer to show these files instead of only one dataset like it currently does. (and also people to be able to load them as splits instead of loading through data_files) e.g GLUE has various splits on viewer but it’s too overkill to ask people to implement loading script, so it would be better to let them define these in the README file instead.

Also pinging @polinaeterna @lhoestq @adrinjalali

Issue Analytics

State:
Created 10 months ago
Reactions:3
Comments:7 (7 by maintainers)

Top GitHub Comments

3reactions

polinaeternacommented, Nov 30, 2022

@merveenoyan ignore my comment above, I’m switching to this task now 😄

2reactions

lhoestqcommented, Nov 9, 2022

We can add new metadata yaml field (say, “custom_configs_info”), so that we can provide smth like:

Love it ! Some other ideas to name the “custom_configs_info” field: “configs”, “parameters”, “config_args”, “configurations”

it might require changes in interaction with the viewer on the hub side - to parse these configurations, as they not default configurations (not in BUILDER_CONFIGS list)

If we update the get_dataset_config_names() function in datasets in inspect.py we should be fine - that’s what the viewer is using

Overall, I would start from implementing the first solution since it’s related to what I’m doing now and is super useful for datasets in general. And then if we agree that having more flexibility in providing parameters to the viewer is required, I can implement the second one. Let me know what you think 😃

Actually I feel like the second solution includes the first use case you mentioned. If you implement the second solution, then users would just have to add a few lines of YAML and their directories would be considered configurations no ? Maybe there’s no need to implement two different logics to do the same thing

Top Results From Across the Web

Create a dataset card - Hugging Face

Fill out the dataset card sections to the best of your ability. ... You can use the dataset_info YAML fields to define additional...

huggingface_datasets/ADD_NEW_DATASET.md at master ... - GitHub

Open a new online dataset card form to fill out: you will be able to download it ... configurations and/or splits (usually at...

Split Single Dataset into Multiple DataSets based on Condition

Defining different types of Datastores (Source and destination data stores) · Use data store and system configurations · Defining file ...

Q&A Flashcards - Quizlet

A. The benefit of analyzing the metadata is that you can clearly identify data inconsistences with your dataset. B. The benefit of analyzing...

Advanced Tool Development Topics - Planemo - Read the Docs

One possible implementation for tests is as follows (sections with ... Galaxy Pull Request #538 implemented the ability to define nested output collections....