Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Datasets

See original GitHub issue

Description

In discussing with folks at the CCRMA MIR workshop, it seems like there’s a legitimate need for an easy way to access data for development / debugging purposes. I don’t think it makes sense to ship more than one or two tracks in the main librosa package itself, but I’m open to the idea of having some kind of dataset module akin to those found in sklearn, nltk, etc.

I suspect most people will agree that this would be useful in the abstract, but there’s a lot of details to sort out before such a thing could be reasonably implemented. A few points to kick off the discussion:

Should this even be a librosa function? e.g., librosa.data.install('gtzan')? Or should it live in an entirely different project (librosa_data)?
Where should data be hosted? I could imagine things getting rather large. Would figshare be reasonable?
Where should datasets be indexed? Internally to librosa, or should it have a remote index hosted on the web so that we can add new datasets out of cycle with librosa releases?
What format should data be stored in? I’m thinking ogg or flac for audio. What about annotations? I’d obviously lean toward jams, but there are some floating issues around collection management that I think should be sorted out before we go down that road. This also adds considerable (but, I believe in the long-run, reasonable) complexity overhead to getting directly at the raw data (via jams objects).
How are collections accessed once installed? Unlike standard UCI machine learning datasets, MIR data does not have a nice, fixed-dimensional tabular form in general, so we can’t just have a pair of X/Y arrays. This issue might resolve itself if we go ahead with jams after implementing some collection-management utilities, but it’s worth considering in advance.

(tag @justinsalamon @stefan-balke @jongwook)

Issue Analytics

State:
Created 5 years ago
Reactions:7
Comments:16 (11 by maintainers)

Top GitHub Comments

2reactions

lostanlencommented, Apr 29, 2019

update: colleagues have started the mirdata project. There is no official release yet, but it seems mature enough already to be considered a solid candidate for us to close this issue, and defer to them for our data-intensive debugging/development/demonstration/education/reproducibility purposes.

https://github.com/mir-dataset-loaders/mirdata

note that mirdata has virtually zero dependencies, if we leave aside those needed for compatibility (six, future) and testing (pytest, testcontainers). It’s distributed in PyPI and the source code has a BSD-3 license. So although i don’t see the need to make mirdata a dependency of librosa, i would be totally in favor of extensively using mirdata in our demos and tutorials for how to use librosa beyond a single audio track (good old Kevin MacLeod …)

1reaction

lostanlencommented, Nov 13, 2019

closed by https://github.com/mir-dataset-loaders/mirdata/pull/150 many thanks to @rabitt, @drubinstein, @magdalenafuentes, @andreasjansson, @keunwoochoi, and everyone who contributed dataset loaders.

Top Results From Across the Web

Find Open Datasets and Machine Learning Projects - Kaggle

Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More....

Dataset - Catalog

The National Student Loan Data System (NSLDS) is the national database of information about loans and grants awarded to students under Title IV...

Dataset Search

Learn more about Dataset Search.

21 Places to Find Free Datasets for Data Science Projects ...

A dataset, or data set, is simply a collection of data. The simplest and most common format for datasets you'll find online is...

Datasets and pre-built solutions - Google Cloud

Discover and access unique and valuable datasets and pre-built solutions from Google, public, or commercial providers. With fully managed data pipelines, you ...