ProgressBar ETA with IterableDataset where __len__ undefined
See original GitHub issue❓ Questions/Help/Support
I’ve been successfully using ignite with regular Dataset
/TensorDataset
classes in the past. These have a fixed length and are tied to a DataLoader
with a DistributedSampler
. Keeping all other training hyper-parameters equal, if I increase the number of nodes/GPUs, I’ve always noticed that the ETA displayed by the ProgressBar
reduces.
Then, I switched to an IterableDataset
where the length was computable in advance and so __len__
was defined. There is no DistributedSampler
defined in this case because the dataset is iterable: the data files are grouped into distinct subsets in advance and assigned to different ranks. In this scenario too, I noticed that keeping all else equal, the ETA displayed by ProgressBar
reduces when the number of nodes/GPUs increases. Some earlier discussion on this here: https://github.com/pytorch/ignite/issues/1263.
Finally, I came across the setting where I had a massive dataset where the length (i.e., number of data-points) was not computable in advance. So I removed the __len__
definition, making the IterableDataset
more generic.
Unfortunately, in this final setting, I find that the ETA displayed by ProgressBar
doesn’t reduce when the number of nodes/GPUs increases. I tried training for a fixed 50000 iterations, i.e., epoch_length
of 50000. I notice that if I train on 1 GPU, the ETA is much lesser than if I train on > 1 GPUs. I also notice that the overall time taken per iteration is much lesser when 1 GPU is used.
I’m confused about this behavior, it doesn’t seem like I’m doing something incorrect. Could you please explain what may be happening?
Issue Analytics
- State:
- Created 3 years ago
- Comments:9
@vfdev-5 oh, I see, I am using
MetricsLambda
with a callable method to perform atorch.distributed.all_reduce
of some of the metrics like NLL and Accuracy, like this one withaverage_distributed_scalar()
.Does this mean I need to necessarily stop doing that and switch to using
idist.set_local_rank()
with my local_rank, so that thesync_all_reduce
decorator for the metrics get triggered? Am I missing something else that needs to be upgraded?I think it’d be cool if you could do a PR for the above repo to allow for these changes in ignite. It is a useful example repo to highlight new ignite functionality.
@vfdev-5 gotcha, but why would
idist
even be invoked in my case, leading to that warning? I am not even importing it explicitly in my code, I have been using ignite for its other features - in fact, I did not even know about theidist
feature until I saw this warning.Once I get a better understanding of this, I’ll check the phrasing of the warning on the PR.