Datasets
See original GitHub issueDescription
In discussing with folks at the CCRMA MIR workshop, it seems like there’s a legitimate need for an easy way to access data for development / debugging purposes. I don’t think it makes sense to ship more than one or two tracks in the main librosa package itself, but I’m open to the idea of having some kind of dataset module akin to those found in sklearn, nltk, etc.
I suspect most people will agree that this would be useful in the abstract, but there’s a lot of details to sort out before such a thing could be reasonably implemented. A few points to kick off the discussion:
- Should this even be a librosa function? e.g.,
librosa.data.install('gtzan')
? Or should it live in an entirely different project (librosa_data
)? - Where should data be hosted? I could imagine things getting rather large. Would figshare be reasonable?
- Where should datasets be indexed? Internally to librosa, or should it have a remote index hosted on the web so that we can add new datasets out of cycle with librosa releases?
- What format should data be stored in? I’m thinking ogg or flac for audio. What about annotations? I’d obviously lean toward jams, but there are some floating issues around collection management that I think should be sorted out before we go down that road. This also adds considerable (but, I believe in the long-run, reasonable) complexity overhead to getting directly at the raw data (via jams objects).
- How are collections accessed once installed? Unlike standard UCI machine learning datasets, MIR data does not have a nice, fixed-dimensional tabular form in general, so we can’t just have a pair of X/Y arrays. This issue might resolve itself if we go ahead with jams after implementing some collection-management utilities, but it’s worth considering in advance.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:7
- Comments:16 (11 by maintainers)
Top Results From Across the Web
Find Open Datasets and Machine Learning Projects - Kaggle
Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More....
Read more >Dataset - Catalog
The National Student Loan Data System (NSLDS) is the national database of information about loans and grants awarded to students under Title IV...
Read more >21 Places to Find Free Datasets for Data Science Projects ...
A dataset, or data set, is simply a collection of data. The simplest and most common format for datasets you'll find online is...
Read more >Datasets and pre-built solutions - Google Cloud
Discover and access unique and valuable datasets and pre-built solutions from Google, public, or commercial providers. With fully managed data pipelines, you ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
update: colleagues have started the
mirdata
project. There is no official release yet, but it seems mature enough already to be considered a solid candidate for us to close this issue, and defer to them for our data-intensive debugging/development/demonstration/education/reproducibility purposes.https://github.com/mir-dataset-loaders/mirdata
note that
mirdata
has virtually zero dependencies, if we leave aside those needed for compatibility (six
,future
) and testing (pytest
,testcontainers
). It’s distributed in PyPI and the source code has a BSD-3 license. So although i don’t see the need to makemirdata
a dependency oflibrosa
, i would be totally in favor of extensively usingmirdata
in our demos and tutorials for how to uselibrosa
beyond a single audio track (good old Kevin MacLeod …)closed by https://github.com/mir-dataset-loaders/mirdata/pull/150 many thanks to @rabitt, @drubinstein, @magdalenafuentes, @andreasjansson, @keunwoochoi, and everyone who contributed dataset loaders.