question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Provided condor example does not work

See original GitHub issue

The example provided in the documentation with a few lines tweaked to make it a work queue manager for condor does not work:

...
wq_env_tarball='coffea-env.tar.gz'
...
    # The named conda environment tarball will be transferred to each worker,                                                                                                                                                                                                                       
    # and activated. This is useful when coffea is not installed in the remote                                                                                                                                                                                                                      
    # machines.                                                                                                                                                                                                                                                                                     
    'environment_file': wq_env_tarball,
...
workers = wq.Factory(
    # local runs:                                                                                                                                                                                                                                                                                   
    # batch_type="local",                                                                                                                                                                                                                                                                           
    # manager_host_port="localhost:{}".format(wq_port)                                                                                                                                                                                                                                              
    # with a batch system, e.g., condor.                                                                                                                                                                                                                                                            
    # (If coffea not at the installation site, then a conda                                                                                                                                                                                                                                         
    # environment_file should be defined in the work_queue_executor_args.)                                                                                                                                                                                                                          
    batch_type="condor", manager_name=wq_manager_name
)
(coffea-env) [bryantp@cmslpc139]/uscms_data/d3/bryantp/CMSSW_11_1_0_pre5/src% python condor_test.py                                          
------------------------------------------------
Example Coffea Analysis with Work Queue Executor
------------------------------------------------
Master Name: -M coffea-wq-bryantp
Environment: coffea-env.tar.gz
------------------------------------------------
/uscms_data/d3/bryantp/mambaforge/envs/coffea-env/lib/python3.8/site-packages/coffea/util.py:154: FutureWarning: In coffea version v0.8.0 (target date: 31 Dec 2022), this will be an error.
(Set coffea.deprecations_as_errors = True to get a stack trace now.)
ImportError: coffea.hist is deprecated
  warnings.warn(message, FutureWarning)
Listening for work queue workers on port 9123...
warning: this work queue manager is visible to the public.
warning: you should set a password with the --password option.
warning: using plain-text when communicating with workers.
warning: use encryption with a key and cert when creating the manager.
submitted preprocessing task 1                                                                                                                                                                                                                                                                      
submitted preprocessing task 2                                                                                                                                                                                                                                                                      
Preprocessing   0%|                                                                                                                                                                                                                          | 0/2 [00:00<?, ?file/s]   

The job holds forever:

(coffea-env) [bryantp@cmslpc139]/uscms_data/d3/bryantp/CMSSW_11_1_0_pre5/src% cq --analyze 76541656.0


-- Schedd: lpcschedd1.fnal.gov : <131.225.188.55:9618?...


-- Schedd: lpcschedd2.fnal.gov : <131.225.188.57:9618?...


-- Schedd: lpcschedd3.fnal.gov : <131.225.188.235:9618?...
The Requirements expression for job 76541656.000 is

    (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer)



76541656.000:  Job is held.

Hold reason: Cannot access initial working directory /tmp/wq-py-staging-_afccr5m/wq-factory-scratch-8p51i_4r: No such file or directory

Last successful match: Thu Sep  1 09:56:49 2022


76541656.000:  Run analysis summary ignoring user priority.  Of 1515 machines,
    509 are rejected by your job's requirements
    518 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
    488 are able to run your job

I am trying to get some help with an earlier issue: https://github.com/CoffeaTeam/coffea/issues/715#issue-1351201786

The issue might have to do with how the tmp directory is created by work_queue_main but I am not sure what is going on there.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
btovarcommented, Sep 7, 2022

It seems that the condor_submit_workers does assume access to /tmp. Let me submit a fix for that.

0reactions
patrickbryantcommented, Sep 7, 2022
work_queue_factory version 7.4.8 FINAL (released 2022-07-05 18:31:17 +0000)
        Built by conda@f730a3444de2 on 2022-07-05 18:31:17 +0000
        System: Linux f730a3444de2 5.13.0-1031-azure #37~20.04.1-Ubuntu SMP Mon Jun 13 22:51:01 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
        Configuration: --debug --prefix /uscms_data/d3/bryantp/mambaforge/envs/coffea-env --with-base-dir /uscms_data/d3/bryantp/mambaforge/envs/coffea-env --with-python3-path /uscms_data/d3/bryantp/mambaforge/envs/coffea-env --with-perl-path no --with-readline-path no --with-fuse-path no --without-system-parrot --without-system-prune --without-system-umbrella --without-system-weaver

Ok, adding the scratch_dir option at least got the manager job to submit without getting stuck in hold. Unfortunately the workers submitted with condor_submit_workers still get stuck in hold.

-- Schedd: lpcschedd5.fnal.gov : <131.225.204.62:9618?... @ 09/07/22 09:34:19
 ID          OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
30108406.0   bryantp         9/7  09:31   0+00:00:00 H  0    7.3 work_queue_worker --cores $$([TARGET.Cpus]) --memory $$([TARGET.Memory]) --disk $$([TARGET.Disk/1024]) -M coffea-wq-bryantp
30108406.1   bryantp         9/7  09:31   0+00:00:00 H  0    7.3 work_queue_worker --cores $$([TARGET.Cpus]) --memory $$([TARGET.Memory]) --disk $$([TARGET.Disk/1024]) -M coffea-wq-bryantp
30108406.2   bryantp         9/7  09:31   0+00:00:00 H  0    7.3 work_queue_worker --cores $$([TARGET.Cpus]) --memory $$([TARGET.Memory]) --disk $$([TARGET.Disk/1024]) -M coffea-wq-bryantp
30108406.3   bryantp         9/7  09:31   0+00:00:00 H  0    7.3 work_queue_worker --cores $$([TARGET.Cpus]) --memory $$([TARGET.Memory]) --disk $$([TARGET.Disk/1024]) -M coffea-wq-bryantp
30108406.4   bryantp         9/7  09:31   0+00:00:00 H  0    7.3 work_queue_worker --cores $$([TARGET.Cpus]) --memory $$([TARGET.Memory]) --disk $$([TARGET.Disk/1024]) -M coffea-wq-bryantp
30108406.5   bryantp         9/7  09:31   0+00:00:00 H  0    7.3 work_queue_worker --cores $$([TARGET.Cpus]) --memory $$([TARGET.Memory]) --disk $$([TARGET.Disk/1024]) -M coffea-wq-bryantp
30108406.6   bryantp         9/7  09:31   0+00:00:00 H  0    7.3 work_queue_worker --cores $$([TARGET.Cpus]) --memory $$([TARGET.Memory]) --disk $$([TARGET.Disk/1024]) -M coffea-wq-bryantp
30108406.7   bryantp         9/7  09:31   0+00:00:00 H  0    7.3 work_queue_worker --cores $$([TARGET.Cpus]) --memory $$([TARGET.Memory]) --disk $$([TARGET.Disk/1024]) -M coffea-wq-bryantp
30108410.0   bryantp         9/7  09:34   0+00:00:02 <  0    0.0 condor.sh ./poncho_package_run -e coffea-env.tar.gz ./work_queue_worker -M coffea-wq-bryantp -t 300 -C ''''catalog.cse.nd.edu,backup-catalog.cse.nd.edu'''' -d all -o worker.log --cores=$$([TARGET.Cpus]) --memory=$$([TARGET.Mem

...
30108406.000:  Job is held.

Hold reason: Cannot access initial working directory /tmp/bryantp-workers: No such file or directory
...
Read more comments on GitHub >

github_iconTop Results From Across the Web

Condor Administration Tutorial: Hands On Workbook
condor_configure set the START expression to TRUE. As a result the machine defaults to Idle and will always accept jobs. Try toggling START...
Read more >
7.6 Troubleshooting
Other symptoms of this problem include Condor tools (such as condor_ status and condor_ q) not producing any output, or commands that appear...
Read more >
Troubleshooting Condor Batch System - uscms
Solution: the ./condor_exec.exe error was an example of passing a system installed executable to the condor job from a SL6 system. The condor...
Read more >
Condor Spread: Definition, Types, and Strategy Examples
A condor spread is a non-directional options strategy that limits both gains and losses while seeking to profit from either low or high...
Read more >
Submitting a Job — HTCondor Manual 10.1.1 documentation
Submitting a Job¶. The condor_submit command takes a job description file as input and submits the job to HTCondor. In the submit description...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found