The multi-node / multi-gpu training and repeat logging on each process
See original GitHub issueHow do we deal with repetitive warnings that can’t be shut off on a multi-node/multi-gpu environment?
e.g. at BigScience we started using HF Tokenizer and now this gets repeated hundreds of times:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The only way for me to fix this is to push the logging level to ERROR on the replicas:
if args.rank == 0:
transformers.utils.logging.set_verbosity(logging.INFO)
else:
transformers.utils.logging.set_verbosity(logging.ERROR)
but then if there is actually a real warning in some process, then I won’t see it.
Any good suggestions here?
Thank you!
p.s. As a background story: I have each component we use in Megatron-DeepSpeed spitting just a few dozens of these which then get multiplied by say 512 or 1024 times. And the log file becomes unusable and makes the troubleshooting when things crash a very difficult experience. Hence I really need to find a way not to log anything that is not really pertinent to a specific replica process. Moreover many processes aren’t replicas of rank 0 process and do unique things, e.g. in the pipeline setup. But in the case of tokenizer it is the same on all processes.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Sounds great to me!
This looks like a great solution to me. Wdyt @LysandreJik ?