question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cluster-based stats improvements - potential GSoC (and future)

See original GitHub issue

Mirroring the FieldTrip API

Eric @larsoner suggested to mirror fieldtrip’s cluster-based API in mne python:

Mikolaj, perhaps instead of making a new API proposal, you could try Python-izing the FieldTrip API. No need to reinvent the wheel here if theirs is complete / extensible enough already, and it would have the added benefit of making user code porting simpler.

That makes sense, although I would prefer to change a few things, notably the design, ivar and uvar which are quite confusing for newcomers. I can start with showing how an almost exact copy of fieldtrip’s API would look (I omit method=‘montecarlo’, because we are considering now mirroring only the cluster-based API - or maybe not, as @jona-sassenhagen suggests):

# within-subjects t-test
from mne.stats import ttest_rel_no_p, cluster_based_statistic

conditions = ['condA', 'condB']

erp_list = list()
for epochs in epochs_list:
    erp_list += [epochs[cond].average() for cond in conditions]

# in fieldtrip, because all the files/trials are passed in a cell array (or
# just varargin) you need additional info sepcifying the design:
conditions = np.array([0, 1] * n_subjects)
subjects = np.array([[i, i] for i in range(n_subjects)]).ravel()
design = np.stack([subjects, conditions], axis=0)

# later in the function `uvar` denotes the row containing 'units of observation',
# (here these are single subject identifiers), while `ivar` denotes levels of independent
# variable. Additonally we have adjacency structure in `adjacency` var

cluster_based_statistic(erp_list, statistic=ttest_rel_no_p, alpha=0.025,
                        cluster_alpha=0.05, cluster_statistic='maxsum',
                        min_nb_chan=2, neighbours=adjacency, tail=0,
                        cluster_tail=0, n_permutations=1000, design=design,
                        uvar=0, ivar=1)
# the alpha is set to 0.025 above to correct for two-tail, fieldtrip also
# allows to set another parameter to have two-tail cluster p values returned

I like the Fieldtrip’s cluster-based API in general but not the design, ivar, uvar approach. This is something that would require some further thought. Passing a real design matrix might be ok to add some time later for more advanced users, but the basic API should be much simpler. A sketchy idea I have:

erp_list = {cond: [epochs[cond].average() for epochs in epochs_list]
            for cond in cond_list}

cluster_based_statistic(erp_list, statistic=ttest_rel_no_p,
                        design='condA - condB', neighbours=adjacency)

and then in ANOVA cases, rm_anova API could be mirrored with factor_levels arg (or levels to also accommodate linear regression) and effects could be passed to design (which could be called effects instead of design BTW).

Additionally times or freqs (when dealing with TFR or spectra) could be passed to crop the search space (this is be similar to fieldtrip’s API). The next step would be to return some kind of object that holds the results together and allows for selecting clusters and plotting/inspecting results. But that’s another story (I had sketched an API for such object some time ago but couldn’t dig it up now).

So, overall, the fieldtrip API is not so different from what we have in mne, but following things are missing:

  • support for mne objects and simple interface for at least dependent and independent t-tests
  • (then adding anova and regression but that could all be followups)
  • support for 3d data (channel x frequency x time)
  • min_nb_chan filtering (other forms would be nice too, but as far as I remember they are not supported in fieldtrip)
  • separating tail and cluster_tail (minor detail, not that useful in most cases)
  • step_down is nice in mne and I think it should be included in top-level cluster API
  • some rudimentary viz for cluster-based

Other

I also attach the more general points I sketched in the first mail (those not covered above):

  • expose a public clustering function (one that works for 1d, 2d and 3d) - so that users can build their more exotic cluster-based approaches more easily (currently all clustering functions are private IIRC)
  • return an object that holds standard cluster-based results with simple easy API (currelntly 4 variables are returned) along with save and viz functionality (ideally, long-term)
  • ability to read and write fieldtrip format cluster-based results (conventionally stat in ft)
  • new cluster-based functions: cluster-based regression; using permutation tests within cluster-based (requires pre-computing the permutation distribution for the ‘smaller’ permutation tests that are used in cluster-based)

Other points gathered from the discussion below:

  • allow for data to be passed as a dictionary:
    erp_list = {cond: [epochs[cond].average() for epochs in epochs_list] for cond in cond_list}
    
  • define a relatively high level intermediate API that works with only arrays and some sort of high level dimension specification
  • use Patsy for the design matrices.
  • consider using HED tags
  • prefer an API that is agnostic to what method you use for correction for multiple tests:
    mass_univariate(erp_list, method=('fdr' | 'cluster' | ...), ...)
    
  • support regression from the start
  • show how to do baseline contrasts (#608)
  • allow threshold=‘tfce’? (#5534)

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:4
  • Comments:18 (18 by maintainers)

github_iconTop GitHub Comments

2reactions
larsonercommented, Apr 5, 2018

No, I don’t want to start with multidimensional Neuroimaging data. Let’s make fake data about subject reaction time or something else that is truly univariate. If we want it to be like MEG data let’s pretend that the data are peak amplitude or latency or something. But we should use synthetic data for this.

1reaction
jona-sassenhagencommented, Mar 10, 2018

If we have an object with cluster-based results, I will write methods functions for it 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cluster-based permutation tests on event-related fields
It is common practice to say that a cluster is significant if its p-value is less than the critical alpha-level. This practice suggests...
Read more >
GSoC 2022 Ideas list
Apache RocketMQ Streams is a Lightweight Streaming Project for RocketMQ , which can be deployed separately or in cluster mode. Various types of ......
Read more >
View all GSoC/JSoC Projects
Improving test coverage (175 hours); Multi-threading Improvement Projects (175 ... potential differences between Julia versions) of the docsystem in Base.
Read more >
GSoC/2019/StatusReports/ThanhTrungDinh
Besides, for the last phase of GSoC, I will concentrate my work on face recognition improvement and face clustering. Coding period: Phase three ......
Read more >
SciPy 1.2.0 Release Notes — SciPy v1.9.3 Manual
#8760: BUG: stats: boltzmann wasn't setting the upper bound. #8763: [DOC] Improved scipy.cluster.hierarchy documentation. #8765: DOC: added example for scipy.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found