Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Domain specific dataset discovery on the Hugging Face hub

See original GitHub issue

Is your feature request related to a problem? Please describe.

The problem

The datasets hub currently has 8,239 datasets. These datasets span a wide range of different modalities and tasks (currently with a bias towards textual data).

There are various ways of identifying datasets that may be relevant for a particular use case:

searching
various filters

Currently, however, there isn’t an easy way to identify datasets belonging to a specific domain. For example, I want to browse machine learning datasets related to ‘social science’ or ‘climate change research’.

The ability to identify datasets relating to a specific domain has come up in discussions around the BigLA datasets hackathon https://github.com/bigscience-workshop/lam/discussions/31#discussioncomment-3123610. As part of the hackathon, we’re currently collecting datasets related to Libraries, Archives and Museums and making them available via the hub. We currently do this under a Hugging Face organization (https://huggingface.co/biglam). However, going forward, I can see some of these datasets being migrated to sit under an organization that is the custodian of the dataset (for example, a national library the data was originally from). At this point, it becomes more difficult to quickly identify datasets from this domain without relying on search.

This is also related to some existing issues on Github related to metadata on the hub:

Describe the solution you’d like

Some possible solutions that may help with this:

Enable domain tags (from a controlled vocabulary)

This would add metadata field to the YAML for the domain a dataset relates to
Advantages:
- the list is controlled, allowing it to be more easily integrated into the datasets tag app (https://huggingface.co/space/huggingface/datasets-tagging)
- the controlled vocabulary could align with an existing controlled vocabulary
- this additional metadata can be used to perform filtering by domain
disadvantages
- choosing the best controlled vocab may be difficult
- there are many datasets that are likely to fit into the ‘machine learning’ domain (i.e. there is a long tail of datasets that aren’t in more ‘generic’ machine learning domain

Enable topic tags (user-generated)

Enable ‘free form’ topic tags for datasets and models. This would be closer to GitHub’s repository topics which can be chosen from a controlled list (https://github.com/topics/) but can also be more user/org specific. This could potentially be useful for organizations to also manage their own models and datasets as the number they hold in their org grows. For example, they may create ‘topic tags’ for a specific project, so it’s clearer which datasets /models are related to that project.

Collections

This solution would likely be the biggest shift and may require significant changes in the hub fronted. Collections could work in several different ways but would include:

Users can curate particular datasets, models, spaces, etc., into a collection. For example, they may create a collection of ‘historic newspapers suitable for training language models’. These collections would not be mutually exclusive, i.e. a dataset can belong to zero, one or many collections. Collections can also potentially be nested under other collections.

This is fairly common on other data reposotiores for example the following collections: Screenshot 2022-07-18 at 11 50 44

all belong under a higher level collection (https://bl.iro.bl.uk/collections/353c908d-b495-4413-b047-87236d2573e3?locale=en).

There are different models one could use for how these collections could be created:

only within an org
for any dataset/model
the owner or a dataset/model has to agree to be added to a collection
a collection owner can have people suggest additions to their collection
other models…

These collections could be thematic, related to particular training approaches, curate models with particular inference properties etc. Whilst some of these features may duplicate current/or future tag filters on the hub, they offer the advantage of being flexible and not having to predict what users will want to do upfront.

There is also potential for automating the creation of these collections based on existing metadata. For example, one could collect models trained on a collection of datasets so for example, if we had a collection of ‘historic newspapers suitable for training language models’ that contained 30 datasets, we could create another collection ‘historic newspaper language models’ that takes any model on the hub whose metadata says it used one or more of those 30 datasets.

There is also the option of exploring ML approaches to suggest models/datasets may be relevant to a particular collection.

This approach is likely to be quite difficult to implement well and would require significant thought. There is also likely to be a benefit in doing quite a bit of upfront work in curating useful collections to demonstrate the benefits of collections.

Describe alternatives you’ve considered A clear and concise description of any alternative solutions or features you’ve considered.

It is possible to collate this information externally, i.e. one could link back to the relevant models/datasets from an external platform.

Additional context Add any other context about the feature request here.

I’m cc’ing others involved in the BigLAM hackathon who may also have thoughts @cakiki @clancyoftheoverflow @albertvillanova

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

julien-ccommented, Jul 19, 2022

Hi @davanstrien If i remember correctly this was also discussed inside a hf.co Discussion, would you be able to link it here too?

(where i suggested using tags: - foo - bar IIRC.

Thanks a ton!

1reaction

albertvillanovacommented, Jul 18, 2022

Thanks for opening this issue @davanstrien.

As we discussed last week, the tag approach would be in principle the simpler to be implemented, either the domain tag (with closed vocabulary: more reliable but also more rigid), or the topic tag (with open vocabulary: more flexible for user needs)

Top Results From Across the Web

discovery · Datasets at Hugging Face

sentence1 (string) label (class label) idx (int32) "Faster than 6MBps." 153 (theoretically,) 17 "But no, Amazon selling 3D printers is not new." 29 (certainly,) 30 "Can...

allenai/qasper · Datasets at Hugging Face

BIBREF11 proposed a sentiment-specific pre-training strategy using unlabeled dialog data (tweet-reply pairs). BIBREF12 proposed a method of building a ...

BeIR/scidocs · Datasets at Hugging Face

"We propose a practical approach based on federated learning to solve out-of-domain issues with continuously running embedded speech-based models such as wake ...

EMBO/BLURB · Datasets at Hugging Face

All the datasets have been obtained and annotated by experts in the biomedical domain. Check the different citations for further details. Annotation process....

the_pile · Datasets at Hugging Face

So by weakly measuring certain aspects of living neurons, it is possible to ... the latter's newly discovered (and not yet published) uncertainty...