CUDA_VISIBLE_DEVICES isn't being respected / hostfile doesn't quite work for one node
See original GitHub issueI’m trying to experiment with DS/1gpu and it’s not respecting CUDA_VISIBLE_DEVICES
I run the script as:
CUDA_VISIBLE_DEVICES=1 deepspeed --num_gpus=1 ./finetune_trainer.py ...
but it runs on GPU 0 ignoring CUDA_VISIBLE_DEVICES=1
Then I tried to use deepspeed launcher flags as explained here: https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node and encountered multiple issues there:
- I think the
--hostfile
cl arg in the example are in the wrong place, shouldn’t they be right afterdeepspeed
and not in the client’s args? that is instead of:
deepspeed <client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json --hostfile=myhostfile
deepspeed --hostfile=myhostfile <client_entry.py> <client args> \
--deepspeed --deepspeed_config ds_config.json
This is a launcher arg and not client arg.
- it can’t handle hostfile with 1 entry:
$ cat hostfile
worker-1 slots=2
$ deepspeed --hostfile hostfile ./finetune_trainer.py ...
Traceback (most recent call last):
File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
main()
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 259, in main
resource_pool = fetch_hostfile(args.hostfile)
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 133, in fetch_hostfile
raise err
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 127, in fetch_hostfile
hostname, slots = line.split()
ValueError: not enough values to unpack (expected 2, got 0)
- it can’t handle exclusion or inclusions w/o the hostfile (misleading docs) Copy-n-pasting from docs - very last code example:
$ deepspeed --exclude="worker-1:0" ./finetune_trainer.py
Traceback (most recent call last):
File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
main()
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 272, in main
active_resources = parse_inclusion_exclusion(resource_pool,
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 240, in parse_inclusion_exclusion
return parse_resource_filter(active_resources,
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 187, in parse_resource_filter
raise ValueError("Hostname '{}' not found in hostfile".format(hostname))
ValueError: Hostname 'worker-1' not found in hostfile
I think the docs are wrong/misleading - they suggest:
You can instead include or exclude specific resources using the --include and --exclude flags. For example, to use all available resources except GPU 0 on node worker-2 and GPUs 0 and 1 on worker-3:
- but they don’t specify that the hostfile is actually needed.
- and the error message is misleading since what
hostfile
is it talking about? I haven’t passed it any hostfiles in this experiment and if it found it in the current dir, that hostfile does haveworker-1
in it - seecat hostfile
earlier. So it should not just say “in hostfile” butin /path/to/hostfile
- I think in this particular situation it should say: “hostfile hasn’t been provided and it’s required”
- this is not the right solution since it tries to ssh to worker-1
subprocess.CalledProcessError: Command '['ssh worker-1 hostname -I']' returned non-zero exit status 255.
So how does one configure deepspeed to use a specific GPU on a single node?
Thank you!
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
nvprof warning on CUDA_VISIBLE_DEVICES - Stack Overflow
It means that nvprof started profiling your code on a GPU context which you made unavailable by setting CUDA_VISIBLE_DEVICES . How can I...
Read more >[slurm-dev] Wrong device order in CUDA_VISIBLE_DEVICES
Dear all, first, let me say that we do not use ConstrainDevice in our setup, so we have to rely on CUDA_VISIBLE_DEVICES to...
Read more >Running YANK — YANK 0.23.7 documentation
Multi-GPU and multi-node systems require masking the GPUs so YANK only sees the one its suppose to. You will need to set the...
Read more >NEWS · slurm-14-03-4-1 - GitLab
If a job requests a partition and it doesn't allow a QOS or account the. 45. job is requesting pend unless EnforcePartLimits=Yes.
Read more >11097 – Control of GPU visibility variables
simple_hip We have 1 devices bash-4.4$ CUDA_VISIBLE_DEVICES=1 . ... might work in your case, but if it doesn't work in all cases, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for reporting this @stas00. There are a few different things going on here, let me try and address all of them here (if I miss one please let me know haha).
If you do not provide the deepspeed launcher with a hostfile (via
-H/--hostfile
) it will only launch processes within that local node. We will try to discover all the available GPUs on the box viatorch.cuda.device_count()
, see here for more details on this logic. Going forward the local node would be referred to aslocalhost
.The deepspeed launcher was primarily written to simplify multi-node training in interactive training environments. It supports arbitrary inclusion/exclusion of gpus/nodes. For example, I may want to exclude just the 3rd gpu on node 5 (maybe it has ECC errors) out of a total of 128 gpus and 32 nodes. This would be achieved via
deepspeed --exclude worker-5:3 train.py
.In order to support this arbitrary inclusion/exclusion our launcher sets the appropriate
CUDA_VISIBLE_DEVICES
at process launch time on each node. This means that if the user sets their ownCUDA_VISIBLE_DEVICES
on the launching node it’s not clear if they want to set this value on the local node or all nodes. We should update our docs to make this more clear though.If you wanted to do something like
CUDA_VISIBLE_DEVICES=1 deepspeed --num_gpus=1 ./finetune_trainer.py ...
I would recommend running this asdeepspeed --include localhost:1 ./finetune_trainer.py ...
.Now if you wanted to use a
hostfile
to define your node list (even just 1 node) you can do that like you have. However, after seeing your ValueError stack trace I think you may have a trailing new-line at the end of the file. It seems our code was not tested with this case and it causes it to crash. I’ve submitted a PR to handle this case gracefully (see #669).I think maybe one of the key missing pieces in our doc is our decision to reference the local node as
localhost
if no hostfile is given. So in the case ofdeepspeed --exclude="worker-1:0" ./finetune_trainer.py
if you added your abovehostfile
I think it should work the way you want.@stas00 Thank you for your advice. I have already opened a new issue #1761.