Inconsistency in F1 metric between manual eval and Trainer.test() run
See original GitHub issue🐛 Bug
When training a multilabel image classifier as described in the docs, (original link:https://lightning-flash.readthedocs.io/en/latest/reference/multi_label_classification.html),
import os.path as osp
from typing import List, Tuple
import pandas as pd
from torchmetrics import F1
import flash
from flash.core.classification import Labels
from flash.core.data.utils import download_data
from flash.image import ImageClassificationData, ImageClassifier
from flash.image.classification.data import ImageClassificationPreprocess
# 1. Download the data
# This is a subset of the movie poster genre prediction data set from the paper
# “Movie Genre Classification based on Poster Images with Deep Neural Networks” by Wei-Ta Chu and Hung-Jui Guo.
# Please consider citing their paper if you use it. More here: https://www.cs.ccu.edu.tw/~wtchu/projects/MoviePoster/
download_data("https://pl-flash-data.s3.amazonaws.com/movie_posters.zip", "data/")
# 2. Load the data
genres = ["Action", "Romance", "Crime", "Thriller", "Adventure"]
def load_data(data: str, root: str = 'data/movie_posters') -> Tuple[List[str], List[List[int]]]:
metadata = pd.read_csv(osp.join(root, data, "metadata.csv"))
return ([osp.join(root, data, row['Id'] + ".jpg") for _, row in metadata.iterrows()],
[[int(row[genre]) for genre in genres] for _, row in metadata.iterrows()])
train_files, train_targets = load_data('train')
test_files, test_targets = load_data('test')
datamodule = ImageClassificationData.from_files(
train_files=train_files,
train_targets=train_targets,
test_files=test_files,
test_targets=test_targets,
val_split=0.1, # Use 10 % of the train dataset to generate validation one.
image_size=(128, 128),
)
# 3. Build the model
model = ImageClassifier(
backbone="resnet18",
num_classes=len(genres),
multi_label=True,
metrics=F1(num_classes=len(genres)),
)
# 4. Create the trainer. Train on 2 gpus for 10 epochs.
trainer = flash.Trainer(max_epochs=10)
# 5. Train the model
trainer.finetune(model, datamodule=datamodule, strategy="freeze")
# 6. Predict what's on a few images!
# Serialize predictions as labels, low threshold to see more predictions.
model.serializer = Labels(genres, multi_label=True, threshold=0.25)
predictions = model.predict([
"data/movie_posters/predict/tt0085318.jpg",
"data/movie_posters/predict/tt0089461.jpg",
"data/movie_posters/predict/tt0097179.jpg",
])
print(predictions)
# 7. Save it!
trainer.save_checkpoint("image_classification_multi_label_model.pt")
I get different F1 metrics for the test set depending on how I run the evaluation:
# Run test with trainer:
trainer.test(model, datamodule=datamodule)
# stdout:
# {'test_binary_cross_entropy_with_logits': 0.5449734330177307,
# 'test_f1': 0.46086955070495605}
# Run test manually:
metric = F1(num_classes=len(genres))
for batch in datamodule.test_dataloader():
image_tensor = batch[DefaultDataKeys.INPUT]
target = batch[DefaultDataKeys.TARGET]
with torch.no_grad():
y_hat = model(image_tensor)
prediction = model.to_metrics_format(y_hat)
metric(prediction, target)
print(metric.compute())
# stdout:
# tensor(0.3891)
To Reproduce
Steps to reproduce the behavior:
- Copy paste the example training code from the link above
- Add the test evaluation code above
- Save and run the script
- See error
Expected behavior
The two F1 metrics should be identical
Environment
- PyTorch Version: 1.8.0
- PyTorch-Lightning: 1.3.5
- Lightning-Flash: 0.3.2
- Torchmetrics: 0.3.2
- OS (e.g., Linux): macOS
- How you installed PyTorch (
conda,pip, source): pip - Python version: 3.8.8
- CUDA/cuDNN version: N/A
- GPU models and configuration: None
- Any other relevant information: None
Additional context
None
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
The Model Performance Mismatch Problem (and what to do ...
The procedure when evaluating machine learning models is to fit and evaluate them on training data, then verify that the model has good...
Read more >Accuracy metric not consistent between training and ... - GitHub
I needed to make some changes in order to 1)run it with batch size = whole dataset, 2) run it with training set...
Read more >f1_score metric in lightgbm - python - Stack Overflow
I implemented as similiar functon to return f1_score as shown below. def f1_metric(preds, train_data): labels = train_data.get_label() return ' ...
Read more >Multi-Class Metrics Made Simple, Part II: the F1-score
In this post I'll explain another popular performance measure, the F1-score, or rather F1-scores, as there are at least 3 variants.
Read more >Accuracy, Precision, Recall & F1-Score - Python Examples
Let's create a training and test split where 30% of the dataset is set aside for testing purposes. 1. 2. 3. 4. 5....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

IMO if this was barebone lightning then I would expect that a batch looked the same regardless if it is inside the lightning trainer or outside, because lightning is “just reorganised pytorch code”. However, if this also should be the case in flash I am not completely sure about. It depends completely on the design philosophy of flash. Since it is at an higher abstraction than lightning, I am fine with this not being supported. @ethanwharris is it possible to extract the data pipeline such that it would be possible to do something like
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.