Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] ThetaGPU installation and run instructions

See original GitHub issue

Describe the bug I will start building up a list of problems I encounter when attempting to run through the new Ray-based installation guide and tutorial for ThetaGPU:

Issues

Installation

source /lus/theta-fs0/software/thetagpu/conda/tf_master/2020-11-11/mconda3/setup.sh should probably be replaced with module load conda/tensorflow, unless you specifically need this older version of TensorFlow, in which case you could do conda/tensorflow/2020-11-11. This is more portable (especially for non-Bash shells).
New users might not know that $PROJECTNAME is by default unset in cd /lus/theta-fs0/projects/$PROJECTNAME

Running

Same comment about replacing source ... with module load conda/tensorflow in SetUp.sh
For SingleNodeRayCluster.sh, maybe specify that GPUS_PER_TASK=8 must be set to 1 if the single-gpu queue is used
Both scripts call SetUpEnv.sh, but the doc says to name the script SetUp.sh earlier
Add / to ACTIVATE_PYTHON_ENV="${CURRENT_DIR}SetUpEnv.sh"
Are the following warnings and diagnostics harmless and expected? Might want to note that, if so

➜  fusiondl_aesp ./SingleNodeRayCluster.sh
Script to activate Python env: /lus/theta-fs0/projects/fusiondl_aesp/SetUp.sh
IP Head: 10.230.2.193:6379
Starting HEAD at thetagpu05
➜  fusiondl_aesp /lus/theta-fs0/projects/fusiondl_aesp/dhgpu/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning:
Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`.
 Please update your install command.
  warnings.warn(
Local node IP: 10.230.2.193
2021-06-08 15:33:32,560 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
2021-06-08 15:33:32,561 WARNING services.py:1730 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 85899345
92 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker contain
er, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Ma
ke sure to set this to more than 30% of available RAM.

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='10.230.2.193:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop

--block
  This command will now block until terminated by a signal.
  Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

felkercommented, Jun 14, 2021

What do you mean when saying (see following quote) because when I am doing echo $PROJECTNAME the variable is empty for me.

Yes it is by default empty; that is what I thought might be worth warning users about. E.g. add “where you set $PROJECTNAME to the the short name of your ALCF project”

Did you hear back from ALCF support about /dev/shm? Would be good to get everything working on single-gpu.

0reactions

Deathn0tcommented, Oct 21, 2021

The module was updated with DeepHyper 0.3.0 on ThetaGPU and a new install documentation was published here