[BUG] ThetaGPU installation and run instructions
See original GitHub issueDescribe the bug I will start building up a list of problems I encounter when attempting to run through the new Ray-based installation guide and tutorial for ThetaGPU:
- https://deephyper.readthedocs.io/en/latest/install/thetagpu.html
- https://deephyper.readthedocs.io/en/latest/user_guides/thetagpu.html
Issues
Installation
-
source /lus/theta-fs0/software/thetagpu/conda/tf_master/2020-11-11/mconda3/setup.sh
should probably be replaced withmodule load conda/tensorflow
, unless you specifically need this older version of TensorFlow, in which case you could doconda/tensorflow/2020-11-11
. This is more portable (especially for non-Bash shells). - New users might not know that
$PROJECTNAME
is by default unset incd /lus/theta-fs0/projects/$PROJECTNAME
Running
- Same comment about replacing
source ...
withmodule load conda/tensorflow
inSetUp.sh
- For
SingleNodeRayCluster.sh
, maybe specify thatGPUS_PER_TASK=8
must be set to 1 if thesingle-gpu
queue is used - Both scripts call
SetUpEnv.sh
, but the doc says to name the scriptSetUp.sh
earlier - Add
/
toACTIVATE_PYTHON_ENV="${CURRENT_DIR}SetUpEnv.sh"
- Are the following warnings and diagnostics harmless and expected? Might want to note that, if so
➜ fusiondl_aesp ./SingleNodeRayCluster.sh
Script to activate Python env: /lus/theta-fs0/projects/fusiondl_aesp/SetUp.sh
IP Head: 10.230.2.193:6379
Starting HEAD at thetagpu05
➜ fusiondl_aesp /lus/theta-fs0/projects/fusiondl_aesp/dhgpu/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning:
Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`.
Please update your install command.
warnings.warn(
Local node IP: 10.230.2.193
2021-06-08 15:33:32,560 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
2021-06-08 15:33:32,561 WARNING services.py:1730 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 85899345
92 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker contain
er, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Ma
ke sure to set this to more than 30% of available RAM.
--------------------
Ray runtime started.
--------------------
Next steps
To connect to this Ray runtime from another node, run
ray start --address='10.230.2.193:6379' --redis-password='5241590000000000'
Alternatively, use the following Python code:
import ray
ray.init(address='auto', _redis_password='5241590000000000')
If connection fails, check your firewall settings and network configuration.
To terminate the Ray runtime, run
ray stop
--block
This command will now block until terminated by a signal.
Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Getting Started on ThetaGPU
Use ThetaGPU compute nodes for building and development. $ qsub -I -n 1 -t 60 -q full-node -A … • Can also login...
Read more >Getting Started on ThetaGPU | Argonne Leadership Computing Facility
In this webinar, we will cover three main topics to help researchers get started with ThetaGPU: (1) compiling and running, (2) profiling and...
Read more >Getting Started on ThetaGPU
Use ThetaGPU compute nodes for building and development. $ qsub -I -n 1 -t 60 -q full-node -A … • Can also login...
Read more >Compiling and running on ThetaGPU
Install NVIDIA Nsight Systems and NVIDIA Nsight Compute after downloading both of them from the. NVIDIA Developer Zone. ▫ Download nsys output ...
Read more >Job and Queue Scheduling on ThetaGPU
Help Desk. Email: support@alcf.anl.gov ... Queueing and Running Jobs ... no underscore) directly will result in an error message.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes it is by default empty; that is what I thought might be worth warning users about. E.g. add “where you set
$PROJECTNAME
to the the short name of your ALCF project”Did you hear back from ALCF support about
/dev/shm
? Would be good to get everything working onsingle-gpu
.The module was updated with DeepHyper 0.3.0 on ThetaGPU and a new install documentation was published here