Provided condor example does not work
See original GitHub issueThe example provided in the documentation with a few lines tweaked to make it a work queue manager for condor does not work:
...
wq_env_tarball='coffea-env.tar.gz'
...
# The named conda environment tarball will be transferred to each worker,
# and activated. This is useful when coffea is not installed in the remote
# machines.
'environment_file': wq_env_tarball,
...
workers = wq.Factory(
# local runs:
# batch_type="local",
# manager_host_port="localhost:{}".format(wq_port)
# with a batch system, e.g., condor.
# (If coffea not at the installation site, then a conda
# environment_file should be defined in the work_queue_executor_args.)
batch_type="condor", manager_name=wq_manager_name
)
(coffea-env) [bryantp@cmslpc139]/uscms_data/d3/bryantp/CMSSW_11_1_0_pre5/src% python condor_test.py
------------------------------------------------
Example Coffea Analysis with Work Queue Executor
------------------------------------------------
Master Name: -M coffea-wq-bryantp
Environment: coffea-env.tar.gz
------------------------------------------------
/uscms_data/d3/bryantp/mambaforge/envs/coffea-env/lib/python3.8/site-packages/coffea/util.py:154: FutureWarning: In coffea version v0.8.0 (target date: 31 Dec 2022), this will be an error.
(Set coffea.deprecations_as_errors = True to get a stack trace now.)
ImportError: coffea.hist is deprecated
warnings.warn(message, FutureWarning)
Listening for work queue workers on port 9123...
warning: this work queue manager is visible to the public.
warning: you should set a password with the --password option.
warning: using plain-text when communicating with workers.
warning: use encryption with a key and cert when creating the manager.
submitted preprocessing task 1
submitted preprocessing task 2
Preprocessing 0%| | 0/2 [00:00<?, ?file/s]
The job holds forever:
(coffea-env) [bryantp@cmslpc139]/uscms_data/d3/bryantp/CMSSW_11_1_0_pre5/src% cq --analyze 76541656.0
-- Schedd: lpcschedd1.fnal.gov : <131.225.188.55:9618?...
-- Schedd: lpcschedd2.fnal.gov : <131.225.188.57:9618?...
-- Schedd: lpcschedd3.fnal.gov : <131.225.188.235:9618?...
The Requirements expression for job 76541656.000 is
(TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer)
76541656.000: Job is held.
Hold reason: Cannot access initial working directory /tmp/wq-py-staging-_afccr5m/wq-factory-scratch-8p51i_4r: No such file or directory
Last successful match: Thu Sep 1 09:56:49 2022
76541656.000: Run analysis summary ignoring user priority. Of 1515 machines,
509 are rejected by your job's requirements
518 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
488 are able to run your job
I am trying to get some help with an earlier issue: https://github.com/CoffeaTeam/coffea/issues/715#issue-1351201786
The issue might have to do with how the tmp directory is created by work_queue_main but I am not sure what is going on there.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Condor Administration Tutorial: Hands On Workbook
condor_configure set the START expression to TRUE. As a result the machine defaults to Idle and will always accept jobs. Try toggling START...
Read more >7.6 Troubleshooting
Other symptoms of this problem include Condor tools (such as condor_ status and condor_ q) not producing any output, or commands that appear...
Read more >Troubleshooting Condor Batch System - uscms
Solution: the ./condor_exec.exe error was an example of passing a system installed executable to the condor job from a SL6 system. The condor...
Read more >Condor Spread: Definition, Types, and Strategy Examples
A condor spread is a non-directional options strategy that limits both gains and losses while seeking to profit from either low or high...
Read more >Submitting a Job — HTCondor Manual 10.1.1 documentation
Submitting a Job¶. The condor_submit command takes a job description file as input and submits the job to HTCondor. In the submit description...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It seems that the condor_submit_workers does assume access to /tmp. Let me submit a fix for that.
Ok, adding the scratch_dir option at least got the manager job to submit without getting stuck in hold. Unfortunately the workers submitted with condor_submit_workers still get stuck in hold.