Example PBS Script
See original GitHub issuePeople using Dask on traditional job schedulers often depend on PBS scripts. It would be useful to include a plain example in the documentation that users to download, modify, and run.
What we do now
Currently we point users to the setup network docs, and in particular the section about using job schedulers with a shared network file system. The instructions there suggest that users submit two jobs, one for the scheduler and one for the workers:
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json
However this is flawed because the scheduler or workers may start and run independently from each other. It would be better to place them into a single job, where one special node is told to be the dask-scheduler
process and all other nodes are told to be the dask-worker
processes. Additionally we would like to offer some guidance on tuning the number of CPUs and pointing workers to use local high-speed scratch disk if available.
PBS script options
Many docs on PBS scripts exist online, but each seems to be made by an IT group at a separate super-computer. It is difficult to tease out what is general to all systems and what is specific to a single supercomputer or job scheduler. After reading from a number of pages I’ve cobbled together the following example.
#!/bin/bash -login
# Configure these values to change the size of your dask cluster
#PBS -t 1-9 # Nine nodes. One scheduler and eight workers
#PBS -l ncpus=4 # Four cores per node.
#PBS -l mem=20GB # 20 GB of memory per node
#PBS -l walltime=01:00:00 # will run for at most one hour
# Environment variables
export OMP_NUM_THREADS=1
# Write ~/scheduler.json file in home directory
# connect with
# >>> from dask.distributed import Client
# >>> client = Client(scheduler_file='~/scheduler.json')
# Start scheduler on first process, workers on all others
if [[ $PBS_ARRAYID == '1' ]]; then
dask-scheduler --scheduler-file $HOME/scheduler.json;
else
dask-worker
--scheduler-file $HOME/scheduler.json \
--nthreads $PBS_NUM_PPN \
--local-directory $TMPDIR \
--name worker-$PBS_ARRAYID \
> $PBS_O_WORKDIR/$PBS_JOBID-$PBS_ARRAYID.out \
2> $PBS_O_WORKDIR/$PBS_JOBID-$PBS_ARRAYID.err;
fi
https://wiki.hpcc.msu.edu/display/hpccdocs/Advanced+Scripting+Using+PBS+Environment+Variables http://www.pbsworks.com/documentation/support/PBSProUserGuide10.4.pdf
Questions
- What is the difference between
ncpus
andppn
? - How about
-t 1-8
andnodes=8
?
Does this actually work? I suspect not. I don’t have a convenient testing system for this and would appreciate coverage by a few different groups.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:127 (93 by maintainers)
Top GitHub Comments
And how about people who are using LSF, Slurm or other queuing systems? I still thinking having your own is more flexible and more robust.
This has been resolved by the dask-jobqueue project.