`convert_to_singleton` seems to hang for OPT-66B
See original GitHub issueWhat is your question?
With the directory prepared
$ ls 66b/
dict.txt reshard-model_part-0-shard0.pt reshard-model_part-3-shard0.pt reshard-model_part-6-shard0.pt
gpt2-merges.txt reshard-model_part-1-shard0.pt reshard-model_part-4-shard0.pt reshard-model_part-7-shard0.pt
gpt2-vocab.json reshard-model_part-2-shard0.pt reshard-model_part-5-shard0.pt
I had to hack checkpoint_utils.py
a bit, since this assumption isn’t true for OPT-66B:
https://github.com/facebookresearch/metaseq/blob/ac8659de23b680005a14490d72a874613ab59381/metaseq/checkpoint_utils.py#L390-L391
with the following instead
# path to checkpoint...-shared.pt
local_path = local_path.split('.')[0] + '-shard0.pt'
paths_to_load = get_paths_to_load(local_path, suffix="shard")
Running the following
NCCL_SHM_DISABLE=1 NCCL_DEBUG=INFO python -m metaseq.scripts.convert_to_singleton 66b/
is taking a long time (22 hours and counting). Initially nvidia-smi
looks like this:
and then the process on
GPU 5
terminated first, and it has been in the following state for hours:
$ nvidia-smi
Thu Oct 13 19:24:37 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:16.0 Off | 0 |
| N/A 54C P0 74W / 300W | 20049MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:17.0 Off | 0 |
| N/A 53C P0 72W / 300W | 20133MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:18.0 Off | 0 |
| N/A 52C P0 73W / 300W | 19845MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:19.0 Off | 0 |
| N/A 50C P0 70W / 300W | 19857MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:00:1A.0 Off | 0 |
| N/A 54C P0 76W / 300W | 20073MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:00:1B.0 Off | 0 |
| N/A 47C P0 44W / 300W | 1413MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:00:1C.0 Off | 0 |
| N/A 50C P0 72W / 300W | 19977MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:00:1D.0 Off | 0 |
| N/A 54C P0 69W / 300W | 19905MiB / 32768MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1335 C python 19788MiB |
| 1 N/A N/A 1419 C ...onda/envs/user/bin/python 19872MiB |
| 2 N/A N/A 1420 C ...onda/envs/user/bin/python 19584MiB |
| 3 N/A N/A 1421 C ...onda/envs/user/bin/python 19596MiB |
| 4 N/A N/A 1422 C ...onda/envs/user/bin/python 19812MiB |
| 6 N/A N/A 1424 C ...onda/envs/user/bin/python 19716MiB |
| 7 N/A N/A 1425 C ...onda/envs/user/bin/python 19644MiB |
+-----------------------------------------------------------------------------+
Is there something obviously wrong here, or something I should try instead? Just in case it’s really taking a long time, it’s still running. The last few logging lines at INFO
level look like this:
(...)
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 14 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO Channel 14 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO Channel 14 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 15 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 15 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO comm 0x7f5f78003090 rank 1 nranks 8 cudaDev 1 busId 170 - Init COMPLETE
i-0b2d24dbd20c27dd0:1420:3386 [2] NCCL INFO comm 0x7f7408003090 rank 2 nranks 8 cudaDev 2 busId 180 - Init COMPLETE
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO comm 0x7fdfc8003090 rank 4 nranks 8 cudaDev 4 busId 1a0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO comm 0x7f5b60003090 rank 0 nranks 8 cudaDev 0 busId 160 - Init COMPLETE
i-0b2d24dbd20c27dd0:1424:3384 [6] NCCL INFO comm 0x7fd82c003090 rank 6 nranks 8 cudaDev 6 busId 1c0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO comm 0x7fd544003090 rank 5 nranks 8 cudaDev 5 busId 1b0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1421:3389 [3] NCCL INFO comm 0x7f9c64003090 rank 3 nranks 8 cudaDev 3 busId 190 - Init COMPLETE
i-0b2d24dbd20c27dd0:1425:3385 [7] NCCL INFO comm 0x7f3fe0003090 rank 7 nranks 8 cudaDev 7 busId 1d0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:1335 [0] NCCL INFO Launch mode Parallel
What’s your environment?
- metaseq Version: 7828d72815a9a581ab47b95876d38cb262741883 (Oct 5 main)
- PyTorch Version: 1.12.1+cu113
- OS: Ubuntu 18.04.6 LTS
- How you installed metaseq:
pip
- Build command you used (if compiling from source): N.A.
- Python version: 3.10
- CUDA/cuDNN version: CUDA 11.8
- GPU models and configuration: 8 x V100 SXM2 32 GB
Issue Analytics
- State:
- Created a year ago
- Comments:38 (25 by maintainers)
Top Results From Across the Web
Managing a Shared Resource Using a Singleton
You use singletons to provide a globally accessible, shared instance of a class. You can create your own singletons as a way to...
Read more >When is a Singleton not a Singleton? - Oracle
Since two Singleton objects belong to two classes of the same name, it will appear at first glance that there are two Singleton...
Read more >Scala reflect string to singleton object - Stack Overflow
I'm looking for a way to convert a Scala singleton object given as a string (for example: package1.Main) to the actual instance of...
Read more >Implementing the Singleton Pattern in C#
In all cases, the property could easily be converted to a method, with no impact on thread-safety or performance. First version - not...
Read more >How to make the perfect Singleton? | by Keval Patel - Medium
The Singleton is one of the Creational Design Patterns in Java. The… ... It seems to be a simple design pattern but when...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@punitkoura 517d7ad indeed works 🎉:
I have checked the generated tokens and they look reasonable.
@tangbinh The branching happens here https://github.com/facebookresearch/metaseq/blob/main/metaseq/distributed/utils.py#L42 … If a distributed port is specified we assume Slurm configuration. If it is unspecified, we go down in the if else tree to correctly infer single node init https://github.com/facebookresearch/metaseq/blob/main/metaseq/distributed/utils.py#L53