Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The multi-node / multi-gpu training and repeat logging on each process

See original GitHub issue

How do we deal with repetitive warnings that can’t be shut off on a multi-node/multi-gpu environment?

e.g. at BigScience we started using HF Tokenizer and now this gets repeated hundreds of times:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

It comes from: https://github.com/huggingface/transformers/blob/efea0f868bd381244e3cef51b388293e41a36d1e/src/transformers/tokenization_utils_base.py#L1934-L1936

The only way for me to fix this is to push the logging level to ERROR on the replicas:

        if args.rank == 0:
            transformers.utils.logging.set_verbosity(logging.INFO)
        else:
            transformers.utils.logging.set_verbosity(logging.ERROR)

but then if there is actually a real warning in some process, then I won’t see it.

Any good suggestions here?

Thank you!

p.s. As a background story: I have each component we use in Megatron-DeepSpeed spitting just a few dozens of these which then get multiplied by say 512 or 1024 times. And the log file becomes unusable and makes the troubleshooting when things crash a very difficult experience. Hence I really need to find a way not to log anything that is not really pertinent to a specific replica process. Moreover many processes aren’t replicas of rank 0 process and do unique things, e.g. in the pipeline setup. But in the case of tokenizer it is the same on all processes.

@sgugger, @LysandreJik