question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA_VISIBLE_DEVICES isn't being respected / hostfile doesn't quite work for one node

See original GitHub issue

I’m trying to experiment with DS/1gpu and it’s not respecting CUDA_VISIBLE_DEVICES

I run the script as:

CUDA_VISIBLE_DEVICES=1 deepspeed --num_gpus=1 ./finetune_trainer.py ...

but it runs on GPU 0 ignoring CUDA_VISIBLE_DEVICES=1

Then I tried to use deepspeed launcher flags as explained here: https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node and encountered multiple issues there:

  1. I think the --hostfile cl arg in the example are in the wrong place, shouldn’t they be right after deepspeed and not in the client’s args? that is instead of:
deepspeed <client_entry.py> <client args> \
  --deepspeed --deepspeed_config ds_config.json --hostfile=myhostfile
deepspeed  --hostfile=myhostfile <client_entry.py> <client args> \
  --deepspeed --deepspeed_config ds_config.json

This is a launcher arg and not client arg.

  1. it can’t handle hostfile with 1 entry:
$ cat hostfile
worker-1 slots=2
$ deepspeed --hostfile hostfile  ./finetune_trainer.py ...
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
    main()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 259, in main
    resource_pool = fetch_hostfile(args.hostfile)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 133, in fetch_hostfile
    raise err
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 127, in fetch_hostfile
    hostname, slots = line.split()
ValueError: not enough values to unpack (expected 2, got 0)
  1. it can’t handle exclusion or inclusions w/o the hostfile (misleading docs) Copy-n-pasting from docs - very last code example:
$ deepspeed --exclude="worker-1:0" ./finetune_trainer.py
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
    main()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 272, in main
    active_resources = parse_inclusion_exclusion(resource_pool,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 240, in parse_inclusion_exclusion
    return parse_resource_filter(active_resources,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 187, in parse_resource_filter
    raise ValueError("Hostname '{}' not found in hostfile".format(hostname))
ValueError: Hostname 'worker-1' not found in hostfile

I think the docs are wrong/misleading - they suggest:

You can instead include or exclude specific resources using the --include and --exclude flags. For example, to use all available resources except GPU 0 on node worker-2 and GPUs 0 and 1 on worker-3:

  • but they don’t specify that the hostfile is actually needed.
  • and the error message is misleading since what hostfile is it talking about? I haven’t passed it any hostfiles in this experiment and if it found it in the current dir, that hostfile does have worker-1 in it - see cat hostfile earlier. So it should not just say “in hostfile” but in /path/to/hostfile
  • I think in this particular situation it should say: “hostfile hasn’t been provided and it’s required”
  1. this is not the right solution since it tries to ssh to worker-1
subprocess.CalledProcessError: Command '['ssh worker-1 hostname -I']' returned non-zero exit status 255.

So how does one configure deepspeed to use a specific GPU on a single node?

Thank you!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

9reactions
jeffracommented, Jan 14, 2021

Thanks for reporting this @stas00. There are a few different things going on here, let me try and address all of them here (if I miss one please let me know haha).

If you do not provide the deepspeed launcher with a hostfile (via -H/--hostfile) it will only launch processes within that local node. We will try to discover all the available GPUs on the box via torch.cuda.device_count(), see here for more details on this logic. Going forward the local node would be referred to as localhost.

The deepspeed launcher was primarily written to simplify multi-node training in interactive training environments. It supports arbitrary inclusion/exclusion of gpus/nodes. For example, I may want to exclude just the 3rd gpu on node 5 (maybe it has ECC errors) out of a total of 128 gpus and 32 nodes. This would be achieved via deepspeed --exclude worker-5:3 train.py.

In order to support this arbitrary inclusion/exclusion our launcher sets the appropriate CUDA_VISIBLE_DEVICES at process launch time on each node. This means that if the user sets their own CUDA_VISIBLE_DEVICES on the launching node it’s not clear if they want to set this value on the local node or all nodes. We should update our docs to make this more clear though.

If you wanted to do something like CUDA_VISIBLE_DEVICES=1 deepspeed --num_gpus=1 ./finetune_trainer.py ... I would recommend running this as deepspeed --include localhost:1 ./finetune_trainer.py ....

Now if you wanted to use a hostfile to define your node list (even just 1 node) you can do that like you have. However, after seeing your ValueError stack trace I think you may have a trailing new-line at the end of the file. It seems our code was not tested with this case and it causes it to crash. I’ve submitted a PR to handle this case gracefully (see #669).

I think maybe one of the key missing pieces in our doc is our decision to reference the local node as localhost if no hostfile is given. So in the case of deepspeed --exclude="worker-1:0" ./finetune_trainer.py if you added your above hostfile I think it should work the way you want.

0reactions
skpigcommented, Feb 11, 2022

@stas00 Thank you for your advice. I have already opened a new issue #1761.

Read more comments on GitHub >

github_iconTop Results From Across the Web

nvprof warning on CUDA_VISIBLE_DEVICES - Stack Overflow
It means that nvprof started profiling your code on a GPU context which you made unavailable by setting CUDA_VISIBLE_DEVICES . How can I...
Read more >
[slurm-dev] Wrong device order in CUDA_VISIBLE_DEVICES
Dear all, first, let me say that we do not use ConstrainDevice in our setup, so we have to rely on CUDA_VISIBLE_DEVICES to...
Read more >
Running YANK — YANK 0.23.7 documentation
Multi-GPU and multi-node systems require masking the GPUs so YANK only sees the one its suppose to. You will need to set the...
Read more >
NEWS · slurm-14-03-4-1 - GitLab
If a job requests a partition and it doesn't allow a QOS or account the. 45. job is requesting pend unless EnforcePartLimits=Yes.
Read more >
11097 – Control of GPU visibility variables
simple_hip We have 1 devices bash-4.4$ CUDA_VISIBLE_DEVICES=1 . ... might work in your case, but if it doesn't work in all cases, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found