question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sweeps: how to run agents using qsub job scheduling

See original GitHub issue

wandb --version && python --version && uname

  • Weights and Biases version: 0.9.1
  • Python version: 3.7.7
  • Operating System: Centos 6.5

Description

Trying to run a sweep with wandb agent, but through qsub on a server. wandb command line interface works from login node but is not found by the executed job. Using the workaround https://github.com/lukas/ml-class/issues/24 (using /path/to/my/python -m wandb.cli in place of wandb agent) fixes the first wandb issue appears to execute the final training script with system python rather than the python at /path/to/my/python.

What I Did/Longer description

Background detail: I am working on a cluster which uses qsub for job scheduling. In essence, this means you write a little submission_script.sh which generally runs a few lines that set up an environment (e.g. cd project_dir etc) and then runs your main script. To run a Python script using a particular Python environment you run it explicitly like /users/me/miniconda3/envs/my_env/bin/python -u train.py .

This works nicely with wandb when wandb is only used within Python (i.e. not command line interface like wandb agent)

However, I’m now trying to run sweeps. My hope was I could simply put the line wandb agent walter/myproject/sweepid02938 into the .sh submission script. However, submitting this results in a failure with a wandb: command not found error. This is the first mystery/bug, though I deciding that it seemed plausible that wandb as a command line option was only installed on my login node so followed a workaround I found that suggested using python -m wandb.cli agent in place of wandb agent if the latter didn’t work. Thus I instead used the line: /users/me/miniconda3/envs/my_env/bin/python -m wandb.cli agent walter/myproject/sweepid02938 in my .sh submission script. However, I now get Python ModuleNotFound errors (e.g. import git) which are not consistent with my environment, where this module exists. Indeed, running the above line from the login node rather than as a submission script works.

Is this a known problem? Has anyone had success running sweeps with wandb on a qsub server?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
vanpeltcommented, Jul 21, 2020

Hey @bibbygoodwin you can override the entrypoint in a couple ways.

  1. Instead of running wandb agent you can specify the full path to your env: /users/me/miniconda3/envs/my_env/bin/wandb agent
  2. The agent itself will use /usr/bin/env python to launch the sub process by default. You can override this in your sweep config if that doesn’t work: https://docs.wandb.com/sweeps/configuration#command
1reaction
cvphelpscommented, Jul 21, 2020

Hi Walter, thanks for writing in. This is a good question, and it sounds like a complex environment issue. I’m not experienced with qsub myself, but I’m cc’ing my colleague @vanpelt who might be able to share some guidance.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sweeps: how to run agents using qsub job scheduling #1165
Trying to run a sweep with wandb agent , but through qsub on a server. wandb command line interface works from login node...
Read more >
How to submit a job using qsub - MDC Berlin
qsub is a command used for submission to the Grid Engine cluster. ... This you will see when you use qstat , to...
Read more >
Torque PBS - Softpanorama
Users submit jobs to pbs_server using the qsub command. When pbs_server receives a new job, it informs the scheduler. When the scheduler finds...
Read more >
Submitting Jobs Using qsub - YouTube
This tutorial demonstrates how to submit jobs using the " qsub " command. It still works but FSL systems now use SLURM.
Read more >
HPC: Parametric Sweep variations don't run in parallel
I submit the job with the qsub command that runs the bash script attached ... The ANSYS Help website says that the job...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found