question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`convert_to_singleton` seems to hang for OPT-66B

See original GitHub issue

What is your question?

With the directory prepared

$ ls 66b/
dict.txt         reshard-model_part-0-shard0.pt  reshard-model_part-3-shard0.pt  reshard-model_part-6-shard0.pt
gpt2-merges.txt  reshard-model_part-1-shard0.pt  reshard-model_part-4-shard0.pt  reshard-model_part-7-shard0.pt
gpt2-vocab.json  reshard-model_part-2-shard0.pt  reshard-model_part-5-shard0.pt

I had to hack checkpoint_utils.py a bit, since this assumption isn’t true for OPT-66B: https://github.com/facebookresearch/metaseq/blob/ac8659de23b680005a14490d72a874613ab59381/metaseq/checkpoint_utils.py#L390-L391

with the following instead

    # path to checkpoint...-shared.pt
    local_path = local_path.split('.')[0] + '-shard0.pt'
    paths_to_load = get_paths_to_load(local_path, suffix="shard")

Running the following

NCCL_SHM_DISABLE=1 NCCL_DEBUG=INFO python -m metaseq.scripts.convert_to_singleton 66b/

is taking a long time (22 hours and counting). Initially nvidia-smi looks like this: Screen Shot 2022-10-12 at 2 07 52 PM and then the process on GPU 5 terminated first, and it has been in the following state for hours:

$ nvidia-smi
Thu Oct 13 19:24:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:16.0 Off |                    0 |
| N/A   54C    P0    74W / 300W |  20049MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:17.0 Off |                    0 |
| N/A   53C    P0    72W / 300W |  20133MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:18.0 Off |                    0 |
| N/A   52C    P0    73W / 300W |  19845MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:19.0 Off |                    0 |
| N/A   50C    P0    70W / 300W |  19857MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:00:1A.0 Off |                    0 |
| N/A   54C    P0    76W / 300W |  20073MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    44W / 300W |   1413MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   50C    P0    72W / 300W |  19977MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   54C    P0    69W / 300W |  19905MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1335      C   python                          19788MiB |
|    1   N/A  N/A      1419      C   ...onda/envs/user/bin/python    19872MiB |
|    2   N/A  N/A      1420      C   ...onda/envs/user/bin/python    19584MiB |
|    3   N/A  N/A      1421      C   ...onda/envs/user/bin/python    19596MiB |
|    4   N/A  N/A      1422      C   ...onda/envs/user/bin/python    19812MiB |
|    6   N/A  N/A      1424      C   ...onda/envs/user/bin/python    19716MiB |
|    7   N/A  N/A      1425      C   ...onda/envs/user/bin/python    19644MiB |
+-----------------------------------------------------------------------------+

Is there something obviously wrong here, or something I should try instead? Just in case it’s really taking a long time, it’s still running. The last few logging lines at INFO level look like this:

(...)
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 14 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO Channel 14 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO Channel 14 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 15 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 15 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO comm 0x7f5f78003090 rank 1 nranks 8 cudaDev 1 busId 170 - Init COMPLETE
i-0b2d24dbd20c27dd0:1420:3386 [2] NCCL INFO comm 0x7f7408003090 rank 2 nranks 8 cudaDev 2 busId 180 - Init COMPLETE
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO comm 0x7fdfc8003090 rank 4 nranks 8 cudaDev 4 busId 1a0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO comm 0x7f5b60003090 rank 0 nranks 8 cudaDev 0 busId 160 - Init COMPLETE
i-0b2d24dbd20c27dd0:1424:3384 [6] NCCL INFO comm 0x7fd82c003090 rank 6 nranks 8 cudaDev 6 busId 1c0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO comm 0x7fd544003090 rank 5 nranks 8 cudaDev 5 busId 1b0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1421:3389 [3] NCCL INFO comm 0x7f9c64003090 rank 3 nranks 8 cudaDev 3 busId 190 - Init COMPLETE
i-0b2d24dbd20c27dd0:1425:3385 [7] NCCL INFO comm 0x7f3fe0003090 rank 7 nranks 8 cudaDev 7 busId 1d0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:1335 [0] NCCL INFO Launch mode Parallel

What’s your environment?

  • metaseq Version: 7828d72815a9a581ab47b95876d38cb262741883 (Oct 5 main)
  • PyTorch Version: 1.12.1+cu113
  • OS: Ubuntu 18.04.6 LTS
  • How you installed metaseq: pip
  • Build command you used (if compiling from source): N.A.
  • Python version: 3.10
  • CUDA/cuDNN version: CUDA 11.8
  • GPU models and configuration: 8 x V100 SXM2 32 GB

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:38 (25 by maintainers)

github_iconTop GitHub Comments

2reactions
EIFYcommented, Oct 27, 2022

@punitkoura 517d7ad indeed works 🎉:

$ git checkout remotes/origin/punitkoura/debug-407
M       metaseq/service/constants.py
Previous HEAD position was 8500e88 Add logging
HEAD is now at 517d7ad Add localhost
$ 
$ metaseq-api-local
cfg.distributed_training = {'_name': None, 'distributed_world_size': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:13000', 'distributed_port': 13000, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'broadcast_buffers': False, 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'bf16': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None, 'distributed_num_procs': 8}
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 0
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 1
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 6
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 5
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 3
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 2
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 4
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 7
In distributed utils - cfg.common.model_parallel_size = 8
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
2022-10-27 05:06:25 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
In load_model_ensemble_and_task filenames = ['/home/jason_chou/redspot_home/66b/reshard.pt'] arg_overrides = {} suffix = -model_part-0
Inside load_checkpoint_to_cpu path = /home/jason_chou/redspot_home/66b/reshard-model_part-0.pt arg_overrides = {}
Inside get_paths_to_load local_path = /home/jason_chou/redspot_home/66b/reshard-model_part-0.pt suffix = shard checkpoint_files = ['/home/jason_chou/redspot_home/66b/reshard-model_part-0.pt']
2022-10-27 05:10:48 | INFO | metaseq.checkpoint_utils | Done reading from disk
2022-10-27 05:10:52 | INFO | metaseq.checkpoint_utils | Done loading state dict
2022-10-27 05:10:52 | INFO | metaseq.cli.interactive | loaded model 0
2022-10-27 05:10:55 | INFO | metaseq.cli.interactive | Worker engaged! 172.21.41.241:6010
 * Serving Flask app 'metaseq.cli.interactive_hosted' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
2022-10-27 05:10:55 | INFO | werkzeug | WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:6010
 * Running on http://172.21.41.241:6010
2022-10-27 05:10:55 | INFO | werkzeug | Press CTRL+C to quit
2022-10-27 05:11:16 | INFO | metaseq.hub_utils | Preparing generator with settings {'_name': None, 'beam': 1, 'nbest': 1, 'max_len_a': 0, 'max_len_b': 70, 'min_len': 42, 'sampling': True, 'sampling_topp': 0.9, 'temperature': 1.0, 'no_seed_provided': False, 'buffer_size': 4194304, 'input': '-'}
2022-10-27 05:11:16 | INFO | metaseq.hub_utils | Executing generation on input tensor size torch.Size([1, 38])
2022-10-27 05:11:18 | INFO | metaseq.hub_utils | Total time: 1.235 seconds; generation time: 1.228
2022-10-27 05:11:18 | INFO | werkzeug | 127.0.0.1 - - [27/Oct/2022 05:11:18] "POST /completions HTTP/1.1" 200 -

I have checked the generated tokens and they look reasonable.

1reaction
punitkouracommented, Oct 27, 2022

@tangbinh The branching happens here https://github.com/facebookresearch/metaseq/blob/main/metaseq/distributed/utils.py#L42 … If a distributed port is specified we assume Slurm configuration. If it is unspecified, we go down in the if else tree to correctly infer single node init https://github.com/facebookresearch/metaseq/blob/main/metaseq/distributed/utils.py#L53

Read more comments on GitHub >

github_iconTop Results From Across the Web

Managing a Shared Resource Using a Singleton
You use singletons to provide a globally accessible, shared instance of a class. You can create your own singletons as a way to...
Read more >
When is a Singleton not a Singleton? - Oracle
Since two Singleton objects belong to two classes of the same name, it will appear at first glance that there are two Singleton...
Read more >
Scala reflect string to singleton object - Stack Overflow
I'm looking for a way to convert a Scala singleton object given as a string (for example: package1.Main) to the actual instance of...
Read more >
Implementing the Singleton Pattern in C#
In all cases, the property could easily be converted to a method, with no impact on thread-safety or performance. First version - not...
Read more >
How to make the perfect Singleton? | by Keval Patel - Medium
The Singleton is one of the Creational Design Patterns in Java. The… ... It seems to be a simple design pattern but when...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found