Timer inaccurate due to asynchronous CUDA
See original GitHub issueUpdate:
- remove separation- log train_iter and load_data together - removes issue
The reported “Data Load” and “Train Iter” times can not be trusted because CUDA is not synchronized. Turns out the dataloader is pretty fast.
Unsure what to do with this info. Either we remove this timing breakdown or sprinkle in some synchronizations.
For correct timing the following synchronized calls are necessary. Unfortunately add this does come at a cost of performance.
torch.cuda.synchronize()
with TimeWriter(writer, EventName.ITER_LOAD_TIME, step=step):
ray_indices, batch = next(iter_dataloader_train)
torch.cuda.synchronize()
torch.cuda.synchronize()
with TimeWriter(writer, EventName.ITER_TRAIN_TIME, step=step) as t:
loss_dict = self.train_iteration(ray_indices, batch, step)
torch.cuda.synchronize()
Correct Timings:
Without Sync:
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Inconsistent kernel run times - CUDA
Kernels calls are asynchronous and return control to the host immediately. If you stop the timer after a kernel call without performing an ......
Read more >CUDA Streams: Best Practices and Common Pitfalls
Only a single context can be active on a device at a time. ... All CUDA calls are either synchronous or asynchronous w.r.t...
Read more >Asynchronous CUDA transfer calls not behaving ...
I'm using sys/time.h to profile (code omited for clarity). I find that the cublasSetVectorAsync call dominates the time as though it were behaving...
Read more >Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time ...
Pitfall 2. Documented sources of implicit synchronization may not occur. 1. A page-locked host memory allocation. 2. A device memory allocation.
Read more >Asynchronous Version - an overview | ScienceDirect Topics
One problem with host-device synchronization points such as those produced by the function cudaDeviceSynchronize() and the environment variable ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
If there are some modules in the data loading pipeline that needs to live in the pytorch computation graph, for example if the
Camera
is atorch.nn.Module
that needs to receive gradients and update itself, then it makes sense to keep the cuda operations in the data loading pipline. But in that case I would rather push the camera into the “network” part instead of the “dataloader” part. In any case I would prefer to keep the dataloader part in cpu because it is multi-thread-able.It’s kinda weird to me that there are cuda operations in the data loader. I think pytorch is designed in a way that it encourage you to process data using CPU and multi-threads. For example,
torch.utils.data. DataLoader
does not support a dataset that use cuda operations to load data because it is not multi-thread-able. The benefits for using multi-threads CPU to load data is theoretically you can fully parallel the data loading with the network so you get zero timing for data loading. Using cuda operations in dataloader would always lead to some burden to the pipeline, though in our case it is light enough.