question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] ThetaGPU installation and run instructions

See original GitHub issue

Describe the bug I will start building up a list of problems I encounter when attempting to run through the new Ray-based installation guide and tutorial for ThetaGPU:

Issues

Installation

  • source /lus/theta-fs0/software/thetagpu/conda/tf_master/2020-11-11/mconda3/setup.sh should probably be replaced with module load conda/tensorflow, unless you specifically need this older version of TensorFlow, in which case you could do conda/tensorflow/2020-11-11. This is more portable (especially for non-Bash shells).
  • New users might not know that $PROJECTNAME is by default unset in cd /lus/theta-fs0/projects/$PROJECTNAME

Running

  • Same comment about replacing source ... with module load conda/tensorflow in SetUp.sh
  • For SingleNodeRayCluster.sh, maybe specify that GPUS_PER_TASK=8 must be set to 1 if the single-gpu queue is used
  • Both scripts call SetUpEnv.sh, but the doc says to name the script SetUp.sh earlier
  • Add / to ACTIVATE_PYTHON_ENV="${CURRENT_DIR}SetUpEnv.sh"
  • Are the following warnings and diagnostics harmless and expected? Might want to note that, if so
➜  fusiondl_aesp ./SingleNodeRayCluster.sh
Script to activate Python env: /lus/theta-fs0/projects/fusiondl_aesp/SetUp.sh
IP Head: 10.230.2.193:6379
Starting HEAD at thetagpu05
➜  fusiondl_aesp /lus/theta-fs0/projects/fusiondl_aesp/dhgpu/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning:
Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`.
 Please update your install command.
  warnings.warn(
Local node IP: 10.230.2.193
2021-06-08 15:33:32,560 INFO services.py:1272 -- View the Ray dashboard at http://127.0.0.1:8265
2021-06-08 15:33:32,561 WARNING services.py:1730 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 85899345
92 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker contain
er, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Ma
ke sure to set this to more than 30% of available RAM.

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='10.230.2.193:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop

--block
  This command will now block until terminated by a signal.
  Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
felkercommented, Jun 14, 2021

What do you mean when saying (see following quote) because when I am doing echo $PROJECTNAME the variable is empty for me.

Yes it is by default empty; that is what I thought might be worth warning users about. E.g. add “where you set $PROJECTNAME to the the short name of your ALCF project

Did you hear back from ALCF support about /dev/shm? Would be good to get everything working on single-gpu.

0reactions
Deathn0tcommented, Oct 21, 2021

The module was updated with DeepHyper 0.3.0 on ThetaGPU and a new install documentation was published here

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Started on ThetaGPU
Use ThetaGPU compute nodes for building and development. $ qsub -I -n 1 -t 60 -q full-node -A … • Can also login...
Read more >
Getting Started on ThetaGPU | Argonne Leadership Computing Facility
In this webinar, we will cover three main topics to help researchers get started with ThetaGPU: (1) compiling and running, (2) profiling and...
Read more >
Getting Started on ThetaGPU
Use ThetaGPU compute nodes for building and development. $ qsub -I -n 1 -t 60 -q full-node -A … • Can also login...
Read more >
Compiling and running on ThetaGPU
Install NVIDIA Nsight Systems and NVIDIA Nsight Compute after downloading both of them from the. NVIDIA Developer Zone. ▫ Download nsys output ...
Read more >
Job and Queue Scheduling on ThetaGPU
Help Desk. Email: support@alcf.anl.gov ... Queueing and Running Jobs ... no underscore) directly will result in an error message.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found