Distributed validation
See original GitHub issue❓ Questions and Help
I’m working on a branch where I implemented validation inference at every checkpoint.
Everything was working fine until the new changes from torch.deprecated.distributed
to the new torch.distributed
Now either the Dataloader
breaks on one of the processes or, if I run the inference in the main process, it hangs there forever.
sample code:
if iteration % checkpoint_period == 0 or iteration == max_iter:
checkpointer.save("model_{:07d}".format(iteration), **arguments)
if is_main_process():
if val_data_loader is not None:
logger.info('Evaluating on validation data set')
iou_types = ("bbox",)
if cfg.MODEL.MASK_ON:
iou_types = iou_types + ("segm",)
inference(
model,
val_data_loader,
iou_types=iou_types,
box_only=cfg.MODEL.RPN_ONLY,
device=cfg.MODEL.DEVICE,
expected_results=cfg.TEST.EXPECTED_RESULTS,
expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
verbose=False
)
model.train() # reset training flag
synchronize()
Any work around this?
Issue Analytics
- State:
- Created 5 years ago
- Comments:14 (7 by maintainers)
Top Results From Across the Web
Centralized vs Distributed Data Validation - Service Objects Blog
Distributed approaches. In this case, the validation of each of your data points is distributed across multiple vendors. This approach will ...
Read more >How to validate in DistributedDataParallel correctly? - distributed
I am trying to train and validate model using DistributedDataParallel. Everything is fine during training, but when the model starts ...
Read more >Distributed Validator Technology on Eth2 | by Mara Schmiedt
The DVT middleware allows at-home validators to distribute their validator signing power across a distributed set of active-active redundant ...
Read more >Understanding Distributed Validator Technology (DVT)
Distributed Validator Technology (DVT) is a decentralized open-source protocol that allows the duties of a validator to be distributed among ...
Read more >Distributed validation with Keras and Tensorflow #3535 - GitHub
As the optimizer object is the cornerstone for data distribution with Keras+Tensorflow, I have little idea on cues to investigate. This page ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@fmassa Hi. I additionally tracked the training every iteration and found that it can actually train for the first 7 or 9 iterations, then break with the following error. It seems that something’s index is out of bound in cuda but I can not figure out what it is. Here is the part of the error information, since some of them are in a repeated pattern so I just provide a small fraction.
Update: I also found that it may take place occasionally even for single GPU training.
@fmassa
Thanks for your reply. I modified my codes and it did not hang on since then. Actually I successfully run several rounds after that and suddenly got stuck in a new problem about distributed training. Here is the log, and I update my pytorch to the latest version but it does not work.