question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistency in F1 metric between manual eval and Trainer.test() run

See original GitHub issue

🐛 Bug

When training a multilabel image classifier as described in the docs, (original link:https://lightning-flash.readthedocs.io/en/latest/reference/multi_label_classification.html),

import os.path as osp
from typing import List, Tuple

import pandas as pd
from torchmetrics import F1

import flash
from flash.core.classification import Labels
from flash.core.data.utils import download_data
from flash.image import ImageClassificationData, ImageClassifier
from flash.image.classification.data import ImageClassificationPreprocess

# 1. Download the data
# This is a subset of the movie poster genre prediction data set from the paper
# “Movie Genre Classification based on Poster Images with Deep Neural Networks” by Wei-Ta Chu and Hung-Jui Guo.
# Please consider citing their paper if you use it. More here: https://www.cs.ccu.edu.tw/~wtchu/projects/MoviePoster/
download_data("https://pl-flash-data.s3.amazonaws.com/movie_posters.zip", "data/")

# 2. Load the data
genres = ["Action", "Romance", "Crime", "Thriller", "Adventure"]


def load_data(data: str, root: str = 'data/movie_posters') -> Tuple[List[str], List[List[int]]]:
    metadata = pd.read_csv(osp.join(root, data, "metadata.csv"))
    return ([osp.join(root, data, row['Id'] + ".jpg") for _, row in metadata.iterrows()],
            [[int(row[genre]) for genre in genres] for _, row in metadata.iterrows()])


train_files, train_targets = load_data('train')
test_files, test_targets = load_data('test')

datamodule = ImageClassificationData.from_files(
    train_files=train_files,
    train_targets=train_targets,
    test_files=test_files,
    test_targets=test_targets,
    val_split=0.1,  # Use 10 % of the train dataset to generate validation one.
    image_size=(128, 128),
)

# 3. Build the model
model = ImageClassifier(
    backbone="resnet18",
    num_classes=len(genres),
    multi_label=True,
    metrics=F1(num_classes=len(genres)),
)

# 4. Create the trainer. Train on 2 gpus for 10 epochs.
trainer = flash.Trainer(max_epochs=10)

# 5. Train the model
trainer.finetune(model, datamodule=datamodule, strategy="freeze")

# 6. Predict what's on a few images!
# Serialize predictions as labels, low threshold to see more predictions.
model.serializer = Labels(genres, multi_label=True, threshold=0.25)

predictions = model.predict([
    "data/movie_posters/predict/tt0085318.jpg",
    "data/movie_posters/predict/tt0089461.jpg",
    "data/movie_posters/predict/tt0097179.jpg",
])

print(predictions)

# 7. Save it!
trainer.save_checkpoint("image_classification_multi_label_model.pt")

I get different F1 metrics for the test set depending on how I run the evaluation:

# Run test with trainer:

trainer.test(model, datamodule=datamodule)

# stdout:
# {'test_binary_cross_entropy_with_logits': 0.5449734330177307,
# 'test_f1': 0.46086955070495605}

# Run test manually:

metric = F1(num_classes=len(genres))

for batch in datamodule.test_dataloader():
    image_tensor = batch[DefaultDataKeys.INPUT]
    target = batch[DefaultDataKeys.TARGET]
    with torch.no_grad():
        y_hat = model(image_tensor)
    prediction = model.to_metrics_format(y_hat)
    metric(prediction, target)

print(metric.compute())

# stdout:
# tensor(0.3891)

To Reproduce

Steps to reproduce the behavior:

  1. Copy paste the example training code from the link above
  2. Add the test evaluation code above
  3. Save and run the script
  4. See error

Expected behavior

The two F1 metrics should be identical

Environment

  • PyTorch Version: 1.8.0
  • PyTorch-Lightning: 1.3.5
  • Lightning-Flash: 0.3.2
  • Torchmetrics: 0.3.2
  • OS (e.g., Linux): macOS
  • How you installed PyTorch (conda, pip, source): pip
  • Python version: 3.8.8
  • CUDA/cuDNN version: N/A
  • GPU models and configuration: None
  • Any other relevant information: None

Additional context

None

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
SkafteNickicommented, Jun 29, 2021

IMO if this was barebone lightning then I would expect that a batch looked the same regardless if it is inside the lightning trainer or outside, because lightning is “just reorganised pytorch code”. However, if this also should be the case in flash I am not completely sure about. It depends completely on the design philosophy of flash. Since it is at an higher abstraction than lightning, I am fine with this not being supported. @ethanwharris is it possible to extract the data pipeline such that it would be possible to do something like

model(pipeline(batch))
0reactions
stale[bot]commented, Aug 29, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Model Performance Mismatch Problem (and what to do ...
The procedure when evaluating machine learning models is to fit and evaluate them on training data, then verify that the model has good...
Read more >
Accuracy metric not consistent between training and ... - GitHub
I needed to make some changes in order to 1)run it with batch size = whole dataset, 2) run it with training set...
Read more >
f1_score metric in lightgbm - python - Stack Overflow
I implemented as similiar functon to return f1_score as shown below. def f1_metric(preds, train_data): labels = train_data.get_label() return ' ...
Read more >
Multi-Class Metrics Made Simple, Part II: the F1-score
In this post I'll explain another popular performance measure, the F1-score, or rather F1-scores, as there are at least 3 variants.
Read more >
Accuracy, Precision, Recall & F1-Score - Python Examples
Let's create a training and test split where 30% of the dataset is set aside for testing purposes. 1. 2. 3. 4. 5....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found