question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handling empty datasets in distributed metric computation

See original GitHub issue

🐛 Bug description

Metric computation does not work properly in distributed settings when some processes do not handle any batch in the dataset. It becomes a problem when small validation or test datasets are distributed to processes in an imbalanced manner.

How to Reproduce

Create a Python script named main.py with the following content.

import torch
import ignite.distributed as idist
from torch.utils.data import IterableDataset, DataLoader
from ignite.metrics import Loss
from ignite.engine.engine import Engine
from ignite.engine.events import Events

class SampleDataset(IterableDataset):
    def __iter__(self):
        if idist.get_rank() == 0:
            yield torch.zeros((2, 3)), torch.ones((2, 3))

def report_metrics(engine):
    print(engine.state.metrics)

def test(local_rank):
    data_loader = DataLoader(SampleDataset(), batch_size=None)
    engine = Engine(lambda _engine, batch: batch)
    Loss(torch.nn.BCELoss(reduction="mean")).attach(engine, "loss")
    engine.add_event_handler(Events.COMPLETED, report_metrics)
    engine.run(data_loader)

with idist.Parallel(backend="gloo") as parallel:
    parallel.run(test)

Run the following command inside a CPU Docker container with PyTorch and Ignite installed.

python -m torch.distributed.launch --nproc_per_node=2 --use_env main.py

Problem 1

The command terminated with an error. Part of the output is shown below.

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /opt/conda/conda-bld/pytorch_1595629403081/work/third_party/gloo/gloo/transport/tcp/pair.cc:490] op.preamble.length <= op.nbytes. 8 vs 4

It seems there is type inconsistency (int vs float) inside idist.all_reduce() when calling compute(), because not all processes have called update() at least once. A simple fix could be changing this line to self._sum = 0.0.

However this issue could affect other metrics as well. We probably need unit tests for such scenario for all metrics.

Problem 2

In the above script, if we change Loss(...) to the precision or recall metric (e.g. Precision()), we get the following error message.

Engine run is terminating due to exception: Precision must have at least one example before it can be computed..

The issue is the verification should actually be moved after idist.all_reduce(). Although some processes may have seen empty dataset, the metric is still valid collectively.

Problem 3

After fixing Problem 2, there is still an issue with multi-label precision or recall. For example, changing Loss(...) to Precision(is_multilabel=True, average=True) and running the script will give the following error:

Engine run is terminating due to exception: 'float' object has no attribute 'mean'.

The issue is with this line. Because not all processes have called update() at least once, there is again type inconsistency, where in some processes self._true_positives is of type float while in other processes it is a scalar tensor.

Environment

  • PyTorch Version: 1.6.0
  • Ignite Version: 0.4.1
  • OS: Linux
  • How you installed Ignite (conda, pip, source): pip
  • Python version: 3.7.7
  • Any other relevant information: N/A

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
vfdev-5commented, Aug 17, 2020

@linhr thanks for the report ! Let me reproduire and investigate the issue.

cc @n2cholas as we are working on metrics right now, we maybe have to take this into account

0reactions
vfdev-5commented, Oct 28, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

Hands-On Tutorial: Metrics & Checks (Part 1)
Select the empty datasets from the Flow. (On a Mac, hold Shift to select multiple datasets). · Click Change connection in the “Other...
Read more >
Configure distribution metrics - Logging - Google Cloud
This page explains how to create distribution-type logs-based metrics using the Google Cloud console, the Logging API, and the Google Cloud CLI.
Read more >
How to Deal with Missing Values in Your Dataset - KDnuggets
In this article, we are going to talk about how to identify and treat the missing values in the data step by step....
Read more >
Load - Hugging Face
Wherever a dataset is stored, Datasets can help you load it. ... Once you've loaded a metric for distributed usage, you can compute...
Read more >
Implementing a Metric - TorchMetrics - Read the Docs
compute () : Computes a final value from the state of the metric. ... few other things to handle distributed synchronization and per-step...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found