question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The multi-node / multi-gpu training and repeat logging on each process

See original GitHub issue

How do we deal with repetitive warnings that can’t be shut off on a multi-node/multi-gpu environment?

e.g. at BigScience we started using HF Tokenizer and now this gets repeated hundreds of times:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

It comes from: https://github.com/huggingface/transformers/blob/efea0f868bd381244e3cef51b388293e41a36d1e/src/transformers/tokenization_utils_base.py#L1934-L1936

The only way for me to fix this is to push the logging level to ERROR on the replicas:

        if args.rank == 0:
            transformers.utils.logging.set_verbosity(logging.INFO)
        else:
            transformers.utils.logging.set_verbosity(logging.ERROR)

but then if there is actually a real warning in some process, then I won’t see it.

Any good suggestions here?

Thank you!

p.s. As a background story: I have each component we use in Megatron-DeepSpeed spitting just a few dozens of these which then get multiplied by say 512 or 1024 times. And the log file becomes unusable and makes the troubleshooting when things crash a very difficult experience. Hence I really need to find a way not to log anything that is not really pertinent to a specific replica process. Moreover many processes aren’t replicas of rank 0 process and do unique things, e.g. in the pipeline setup. But in the case of tokenizer it is the same on all processes.

@sgugger, @LysandreJik

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
LysandreJikcommented, Nov 24, 2021

Sounds great to me!

1reaction
sguggercommented, Nov 19, 2021

This looks like a great solution to me. Wdyt @LysandreJik ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient Training on Multiple GPUs - Hugging Face
The processing is done in parallel and all setups are synchronized at the end of each training step. TensorParallel (TP) - each tensor...
Read more >
Multi-GPU and distributed training - Keras
Description: Guide to multi-GPU & distributed training for Keras ... Each of them processes different batches of data, then they merge their ...
Read more >
Single Node, Multi GPU Training - Flyte
We define a train function to enclose the training loop per epoch, and we log the loss and epoch progression, which can later...
Read more >
Trivial Multi-Node Training With Pytorch-Lightning
How you feel running on 200 GPUs. So you have this awesome HPC cluster but still train your model on only 1 GPU?...
Read more >
GPU training (Intermediate) - PyTorch Lightning - Read the Docs
If you request multiple GPUs or nodes without setting a mode, DDP Spawn will ... DP with 2 GPUs, each GPU will process...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found