Domain specific dataset discovery on the Hugging Face hub
See original GitHub issueIs your feature request related to a problem? Please describe.
The problem
The datasets hub currently has 8,239
datasets. These datasets span a wide range of different modalities and tasks (currently with a bias towards textual data).
There are various ways of identifying datasets that may be relevant for a particular use case:
- searching
- various filters
Currently, however, there isn’t an easy way to identify datasets belonging to a specific domain. For example, I want to browse machine learning datasets related to ‘social science’ or ‘climate change research’.
The ability to identify datasets relating to a specific domain has come up in discussions around the BigLA datasets hackathon https://github.com/bigscience-workshop/lam/discussions/31#discussioncomment-3123610. As part of the hackathon, we’re currently collecting datasets related to Libraries, Archives and Museums and making them available via the hub. We currently do this under a Hugging Face organization (https://huggingface.co/biglam). However, going forward, I can see some of these datasets being migrated to sit under an organization that is the custodian of the dataset (for example, a national library the data was originally from). At this point, it becomes more difficult to quickly identify datasets from this domain without relying on search.
This is also related to some existing issues on Github related to metadata on the hub:
- https://github.com/huggingface/datasets/issues/3625
- https://github.com/huggingface/datasets/issues/3877
Describe the solution you’d like
Some possible solutions that may help with this:
Enable domain tags (from a controlled vocabulary)
- This would add metadata field to the YAML for the domain a dataset relates to
- Advantages:
- the list is controlled, allowing it to be more easily integrated into the datasets tag app (https://huggingface.co/space/huggingface/datasets-tagging)
- the controlled vocabulary could align with an existing controlled vocabulary
- this additional metadata can be used to perform filtering by domain
- disadvantages
- choosing the best controlled vocab may be difficult
- there are many datasets that are likely to fit into the ‘machine learning’ domain (i.e. there is a long tail of datasets that aren’t in more ‘generic’ machine learning domain
Enable topic tags (user-generated)
Enable ‘free form’ topic tags for datasets and models. This would be closer to GitHub’s repository topics which can be chosen from a controlled list (https://github.com/topics/) but can also be more user/org specific. This could potentially be useful for organizations to also manage their own models and datasets as the number they hold in their org grows. For example, they may create ‘topic tags’ for a specific project, so it’s clearer which datasets /models are related to that project.
Collections
This solution would likely be the biggest shift and may require significant changes in the hub fronted. Collections could work in several different ways but would include:
Users can curate particular datasets, models, spaces, etc., into a collection. For example, they may create a collection of ‘historic newspapers suitable for training language models’. These collections would not be mutually exclusive, i.e. a dataset can belong to zero, one or many collections. Collections can also potentially be nested under other collections.
This is fairly common on other data reposotiores for example the following collections:
all belong under a higher level collection (https://bl.iro.bl.uk/collections/353c908d-b495-4413-b047-87236d2573e3?locale=en).
There are different models one could use for how these collections could be created:
- only within an org
- for any dataset/model
- the owner or a dataset/model has to agree to be added to a collection
- a collection owner can have people suggest additions to their collection
- other models…
These collections could be thematic, related to particular training approaches, curate models with particular inference properties etc. Whilst some of these features may duplicate current/or future tag filters on the hub, they offer the advantage of being flexible and not having to predict what users will want to do upfront.
There is also potential for automating the creation of these collections based on existing metadata. For example, one could collect models trained on a collection of datasets so for example, if we had a collection of ‘historic newspapers suitable for training language models’ that contained 30 datasets, we could create another collection ‘historic newspaper language models’ that takes any model on the hub whose metadata says it used one or more of those 30 datasets.
There is also the option of exploring ML approaches to suggest models/datasets may be relevant to a particular collection.
This approach is likely to be quite difficult to implement well and would require significant thought. There is also likely to be a benefit in doing quite a bit of upfront work in curating useful collections to demonstrate the benefits of collections.
Describe alternatives you’ve considered A clear and concise description of any alternative solutions or features you’ve considered.
It is possible to collate this information externally, i.e. one could link back to the relevant models/datasets from an external platform.
Additional context Add any other context about the feature request here.
I’m cc’ing others involved in the BigLAM hackathon who may also have thoughts @cakiki @clancyoftheoverflow @albertvillanova
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:9 (9 by maintainers)
Top GitHub Comments
Hi @davanstrien If i remember correctly this was also discussed inside a hf.co Discussion, would you be able to link it here too?
(where i suggested using
tags: - foo - bar
IIRC.Thanks a ton!
Thanks for opening this issue @davanstrien.
As we discussed last week, the tag approach would be in principle the simpler to be implemented, either the domain tag (with closed vocabulary: more reliable but also more rigid), or the topic tag (with open vocabulary: more flexible for user needs)