Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding an optional token to the dataset fetcher code to allow optional fetching from private repositories

See original GitHub issue

Describe the new feature or enhancement

The dataset fetching code inside mne/datasets/utils.py, mne/utils/fetching.py are actually very general. I was hoping to leverage them without copy/pasting the code, so I can make use of upstream possible bug fixes / performance improvements (if they ever occur).

However, in some cases, I would like to unit test against private data I have stored on Github, and they require an API token with the HTTP request. Eventually, then some of that data would be made public after say a publication, but it’s then nice to build into a CI for myself for a private research project in the meantime.

Is it possible to add an optional “token” into the dataset fetcher? This would also enable MNE to leverage private repos. In addition, it would lessen the code dependency for anyone trying to implement a data fetcher without copying every single function from MNE.

Describe your proposed implementation

Add optional token=None kwarg to the following functions:

_download
_fetch_file
_get_http

Then one can easily add optional tokens in _data_path, depending on which dataset is being fetched. This would also enable any “mne” package, like mne-bids/connectivity/etc. to leverage private Github repo data that might get passed in via GH actions.

Describe possible alternatives

If we further refactor things, so that key, urls, archive_names, folder_origs, folder_names, md5_hashes are passed into _data_path, rather then set inside _data_path, then to create a MNE-fetcher, one simply needs to define a data_path that then passes these to _data_path, and they have a fully functional: mne_downstream_package.testing.data_path() that fetches their own datasets for testing without having to rely on MNE-Python for data fetching.

Additional Information

I think this also might be helpful in further cementing MNE-Python as a platform for developing neuroscience/clinical-neuroscience applications that sometimes might need data fetchers in their CI / testing pipeline for “private data”.

Ref: https://chanzuckerberg.com/eoss/proposals/improving-usability-of-core-neuroscience-analysis-tools-with-mne-python/

Issue Analytics

State:
Created 2 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

adam2392commented, Sep 15, 2021

Maybe this is more a curiosity, but do you actually store your data on GitHub? Looking at the LFS storage on GH its $5 per 50 GB per month ($0.1/GB), as opposed to something like backblaze at $0.005/GB. Or are you just talking about small testing datasets that fall under the 1GB limit?

Well, this can actually then be any private URL that requires an API token to access. But yeah I use Github for now :p for small testing datasets that fall under the 1GB limit, but we can’t make public “yet”. And yeah I basically have like 4 different version of the current MNE fetcher code, but they all modify maybe like… 10 LOC, so that suggested to me that this would be a valuable refactoring in MNE.

It seems nowadays as well that MNE is more and more of a “platform”, since it “enables” analysis and testing related to MEG/EEG/iEEG and then offloads analysis and more niche stuff to other mne packages, like mne-connectivity, mne-bids, etc. Part of this enablement in my opinion is making data fetching easier for CI/unittesting.

This is an interesting proposal @adam2392, I also have a private code repository for handling fetching my various datasets, so offloading some of that to MNE would be advantageous to me too. However, …

😃 I’m not alone

0reactions

adam2392commented, Sep 17, 2021

Copying over here for next 2 PRs:

restructure dataset info as nested dict of dicts; incorporate MD5s into python and change pooch to temporarily write the registry file (separate PR) and whatever else is necessary to make it easier to add new built-in datasets
add generic dataset downloader function, that allows e.g., API keys (separate PR). This should probably look like a data_path() function that takes in URL, archive name, hash, etc, and does everything that one of our built-in data_path() functions does (except for checking/setting config keys)

Ref link: https://github.com/mne-tools/mne-python/pull/9742#issuecomment-921932044