question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding an optional token to the dataset fetcher code to allow optional fetching from private repositories

See original GitHub issue

Describe the new feature or enhancement

The dataset fetching code inside mne/datasets/utils.py, mne/utils/fetching.py are actually very general. I was hoping to leverage them without copy/pasting the code, so I can make use of upstream possible bug fixes / performance improvements (if they ever occur).

However, in some cases, I would like to unit test against private data I have stored on Github, and they require an API token with the HTTP request. Eventually, then some of that data would be made public after say a publication, but it’s then nice to build into a CI for myself for a private research project in the meantime.

Is it possible to add an optional “token” into the dataset fetcher? This would also enable MNE to leverage private repos. In addition, it would lessen the code dependency for anyone trying to implement a data fetcher without copying every single function from MNE.

Describe your proposed implementation

Add optional token=None kwarg to the following functions:

  1. _download
  2. _fetch_file
  3. _get_http

Then one can easily add optional tokens in _data_path, depending on which dataset is being fetched. This would also enable any “mne” package, like mne-bids/connectivity/etc. to leverage private Github repo data that might get passed in via GH actions.

Describe possible alternatives

If we further refactor things, so that key, urls, archive_names, folder_origs, folder_names, md5_hashes are passed into _data_path, rather then set inside _data_path, then to create a MNE-fetcher, one simply needs to define a data_path that then passes these to _data_path, and they have a fully functional: mne_downstream_package.testing.data_path() that fetches their own datasets for testing without having to rely on MNE-Python for data fetching.

Additional Information

I think this also might be helpful in further cementing MNE-Python as a platform for developing neuroscience/clinical-neuroscience applications that sometimes might need data fetchers in their CI / testing pipeline for “private data”.

Ref: https://chanzuckerberg.com/eoss/proposals/improving-usability-of-core-neuroscience-analysis-tools-with-mne-python/

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
adam2392commented, Sep 15, 2021

Maybe this is more a curiosity, but do you actually store your data on GitHub? Looking at the LFS storage on GH its $5 per 50 GB per month ($0.1/GB), as opposed to something like backblaze at $0.005/GB. Or are you just talking about small testing datasets that fall under the 1GB limit?

Well, this can actually then be any private URL that requires an API token to access. But yeah I use Github for now :p for small testing datasets that fall under the 1GB limit, but we can’t make public “yet”. And yeah I basically have like 4 different version of the current MNE fetcher code, but they all modify maybe like… 10 LOC, so that suggested to me that this would be a valuable refactoring in MNE.

It seems nowadays as well that MNE is more and more of a “platform”, since it “enables” analysis and testing related to MEG/EEG/iEEG and then offloads analysis and more niche stuff to other mne packages, like mne-connectivity, mne-bids, etc. Part of this enablement in my opinion is making data fetching easier for CI/unittesting.

This is an interesting proposal @adam2392, I also have a private code repository for handling fetching my various datasets, so offloading some of that to MNE would be advantageous to me too. However, …

😃 I’m not alone

0reactions
adam2392commented, Sep 17, 2021

Copying over here for next 2 PRs:

  1. restructure dataset info as nested dict of dicts; incorporate MD5s into python and change pooch to temporarily write the registry file (separate PR) and whatever else is necessary to make it easier to add new built-in datasets
  2. add generic dataset downloader function, that allows e.g., API keys (separate PR). This should probably look like a data_path() function that takes in URL, archive name, hash, etc, and does everything that one of our built-in data_path() functions does (except for checking/setting config keys)

Ref link: https://github.com/mne-tools/mne-python/pull/9742#issuecomment-921932044

Read more comments on GitHub >

github_iconTop Results From Across the Web

Private Repositories - Declarative GitOps CD for Kubernetes
Then, connect the repository using any non-empty string as username and the access token value as a password. Note. For some services, you...
Read more >
Support private repositories and private submodules #287
Currently the checkout action doesn't work with private repositories using a private submodule. As a work-around we use the following in our ...
Read more >
Private Repository Authentication :: Antora Docs
Antora can authenticate with private repositories using HTTP Basic ... pair or access token plus hostname and optional repository path) on its own...
Read more >
Data fetching - DGS Framework - Netflix Open Source
In the getting started guide we introduced the @DgsQuery annotation, which you use to create a data fetcher. In this section, we look...
Read more >
SmallRye GraphQL - Quarkus
Using the above adaption, Map support is added for Quarkus and are mapped to an Entry<Key,Value> with an optional key parameter. This allows...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found