Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Example PBS Script

See original GitHub issue

People using Dask on traditional job schedulers often depend on PBS scripts. It would be useful to include a plain example in the documentation that users to download, modify, and run.

What we do now

Currently we point users to the setup network docs, and in particular the section about using job schedulers with a shared network file system. The instructions there suggest that users submit two jobs, one for the scheduler and one for the workers:

# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json

# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json

However this is flawed because the scheduler or workers may start and run independently from each other. It would be better to place them into a single job, where one special node is told to be the dask-scheduler process and all other nodes are told to be the dask-worker processes. Additionally we would like to offer some guidance on tuning the number of CPUs and pointing workers to use local high-speed scratch disk if available.

PBS script options

Many docs on PBS scripts exist online, but each seems to be made by an IT group at a separate super-computer. It is difficult to tease out what is general to all systems and what is specific to a single supercomputer or job scheduler. After reading from a number of pages I’ve cobbled together the following example.

#!/bin/bash -login
# Configure these values to change the size of your dask cluster
#PBS -t 1-9                 # Nine nodes.  One scheduler and eight workers
#PBS -l ncpus=4             # Four cores per node.
#PBS -l mem=20GB            # 20 GB of memory per node
#PBS -l walltime=01:00:00   # will run for at most one hour

# Environment variables
export OMP_NUM_THREADS=1

# Write ~/scheduler.json file in home directory
# connect with
# >>> from dask.distributed import Client
# >>> client = Client(scheduler_file='~/scheduler.json')

# Start scheduler on first process, workers on all others
if [[ $PBS_ARRAYID == '1' ]]; then
    dask-scheduler --scheduler-file $HOME/scheduler.json;
else
    dask-worker
    --scheduler-file $HOME/scheduler.json \   
    --nthreads $PBS_NUM_PPN \
    --local-directory $TMPDIR \
    --name worker-$PBS_ARRAYID \
    > $PBS_O_WORKDIR/$PBS_JOBID-$PBS_ARRAYID.out \  
    2> $PBS_O_WORKDIR/$PBS_JOBID-$PBS_ARRAYID.err;  
fi

https://wiki.hpcc.msu.edu/display/hpccdocs/Advanced+Scripting+Using+PBS+Environment+Variables http://www.pbsworks.com/documentation/support/PBSProUserGuide10.4.pdf