Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ProgressBar ETA with IterableDataset where len undefined

See original GitHub issue

❓ Questions/Help/Support

I’ve been successfully using ignite with regular Dataset/TensorDataset classes in the past. These have a fixed length and are tied to a DataLoader with a DistributedSampler. Keeping all other training hyper-parameters equal, if I increase the number of nodes/GPUs, I’ve always noticed that the ETA displayed by the ProgressBar reduces.

Then, I switched to an IterableDataset where the length was computable in advance and so __len__ was defined. There is no DistributedSampler defined in this case because the dataset is iterable: the data files are grouped into distinct subsets in advance and assigned to different ranks. In this scenario too, I noticed that keeping all else equal, the ETA displayed by ProgressBar reduces when the number of nodes/GPUs increases. Some earlier discussion on this here: https://github.com/pytorch/ignite/issues/1263.

Finally, I came across the setting where I had a massive dataset where the length (i.e., number of data-points) was not computable in advance. So I removed the __len__ definition, making the IterableDataset more generic.

Unfortunately, in this final setting, I find that the ETA displayed by ProgressBar doesn’t reduce when the number of nodes/GPUs increases. I tried training for a fixed 50000 iterations, i.e., epoch_length of 50000. I notice that if I train on 1 GPU, the ETA is much lesser than if I train on > 1 GPUs. I also notice that the overall time taken per iteration is much lesser when 1 GPU is used.

I’m confused about this behavior, it doesn’t seem like I’m doing something incorrect. Could you please explain what may be happening?

@vfdev-5

Issue Analytics

State:
Created 3 years ago
Comments:9

Top GitHub Comments

1reaction

g-karthikcommented, Dec 24, 2020

@vfdev-5 oh, I see, I am using MetricsLambda with a callable method to perform a torch.distributed.all_reduce of some of the metrics like NLL and Accuracy, like this one with average_distributed_scalar().

Does this mean I need to necessarily stop doing that and switch to using idist.set_local_rank() with my local_rank, so that the sync_all_reduce decorator for the metrics get triggered? Am I missing something else that needs to be upgraded?

I think it’d be cool if you could do a PR for the above repo to allow for these changes in ignite. It is a useful example repo to highlight new ignite functionality.

1reaction

g-karthikcommented, Dec 24, 2020

@vfdev-5 gotcha, but why would idist even be invoked in my case, leading to that warning? I am not even importing it explicitly in my code, I have been using ignite for its other features - in fact, I did not even know about the idist feature until I saw this warning.

Once I get a better understanding of this, I’ll check the phrasing of the warning on the PR.