Target suffix loading: use of `contains` may pick multiple targets
See original GitHub issueIssue description
Current behavior
The current implementation of target suffix loading involves the use of the contains()
string method to capture target files as shown in the below lines:
This means that
- if your derivatives contains multiple filenames with overlap (e.g.
_lesion-manual.nii.gz
and_lesion-manual2.nii.gz
) and - if you use a
"target_suffix"
containing the overlap (e.g._lesion-manual
),
then the current loading process picks all of the filenames with the given overlap as potential targets.
This is problematic because the user might (reasonably) think that only _lesion-manual.nii.gz
will be used as the ground-truth when the "target_suffix"
is specified as _lesion-manual
, whereas in reality both _lesion-manual.nii.gz
and _lesion-manual2.nii.gz
are used.
I didn’t go deeper in the codebase yet to see what happens when multiple targets are picked during the loading process, but I can confirm that the GT that ends up being used changes from run to run based on my experiments with ivadomed --test
.
Expected behavior
I would expect the "target_suffix"
to exactly match the target filename. Two options we have are:
-
Changing line 128 as shown above with something like
& df_next['filename'].str.split(os.extsep).apply(lambda x: x[0]).str.endswith('|'.join(self.target_suffix)))]
which enforces an exact match of target suffix.
-
In case the use of
contains()
is justified in some scenarios and its implemented that way for a specific reason, the user can instead give a full"target_suffix"
including the file extension such as_lesion-manual.nii.gz
. I have tested this and can confirm it solves the problem.
The latter option is notably a lot easier to implement as it doesn’t require a change in the codebase. However, I think this is an issue many people might face without knowing it in the future.
Steps to reproduce
- Download the
basel-mp2rage
dataset as shown here. - Run preprocessing on the data as shown here.
- Insert
print(bids_df.get_deriv_fnames())
after line 370 inivadomed/main.py
which initializes the BIDS data frame. - Run the following config file with
ivadomed --train
. - Take a look at the output of the print statement from step 3.
However, this is specific to one dataset and it is understandable that not everybody will be willing to run preprocessing etc. on this particular dataset. In this case, I would like to point out that this issue is reproducible with any dataset which has multiple annotations per subject where the string target suffix for these annotations have an overlap.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (8 by maintainers)
Feel free to take it on @mariehbourget! I only created a branch so it’s OK.
While debugging for another issue #1096 from an external user, I came across this exact problem. In this case, we have the following target_suffix (coming from the segmentation in ADS):
_seg-axon
_seg-axonmyelin
_seg-myelin
With_seg-axon
in the config file, both_seg-axon
and_seg-axonmyelin
are picked up in the indexation and one or the other is used for training which is not good at all.I’ll inform the user for now and tag this issue as high priority.