Difference between revisions of "Submitting CPU Jobs"

From UFAL AIC
(modified for SLURM - first draft)
(Tag: Replaced)
(Selected submit options)
 
(6 intermediate revisions by the same user not shown)
Line 24: Line 24:
 
* <code>sinfo</code> - print available/total resources
 
* <code>sinfo</code> - print available/total resources
  
=== Output monitoring ===
+
=== Job interaction ===
The standard output of the job is written to the file specified with the option <code>-o</code>. Similarly the errors are logged in the file specified with the option <code>-e</code>.
+
* <code>scontrol show job JOBID</code> - this shows details of running job with JOBID
 +
* <code>scancel JOBID</code> - delete job from the queue
 +
 
 +
=== Selected submit options ===
 +
The complete list of available options for the commands <code>srun</code> and <code>sbatch</code> can be found in [https://slurm.schedmd.com/man_index.html SLURM documentation]. Most of the options listed here can be entered as a command parameters or as an SBATCH directive inside of a script.
 +
 
 +
  -J helloWorld        # name of job
 +
  --chdir /job/path/    # path where the job will be executed
 +
  -p gpu                # name of partition or queue (if not specified default partition is used)
 +
  -q normal            # QOS level (sets priority of the job)
 +
  -c 4                  # reserve 4 CPU threads
 +
  --gres=gpu:1          # reserve 1 GPU card
 +
  -o script.out        # name of output file for the job
 +
  -e script.err        # name of error file for the job
 +
 
 +
== Array jobs ==
 +
If you need to submit rather large number of jobs which are similar (i.e. processing a large number of input files) you should consider launching an ''array job''.
 +
 
 +
For example, one might need to process 1000 files named <code>file_N.txt</code> (where N is a number between 1-1000).
 +
A program that can process one file is called <code>crunchFile</code> and it takes only one argument - the name of the file to process. Instead of calling 1000x:
 +
  sbatch crunchFile file_N.txt
 +
 
 +
we can write a wrapper script <code>crunchScript.sh</code> referring to the SLURM variable <code>SLURM_ARRAY_TASK_ID</code>:
 +
 
 +
  #!/bin/bash
 +
  #SBATCH -p CPU
 +
  #SBATCH --mem 2G
 +
 
 +
  crunchFile name_${SLURM_ARRAY_TASK_ID}.txt
 +
 
 +
and submit all the jobs at once as an ''array job'':
 +
 
 +
  sbatch --array=1-1000%20 crunchScript.sh
 +
 
 +
Where the option <code>--array 1-1000%20</code> means that we want SLURM to:
 +
* launch 1000 instances of <code>crunchScript.sh</code>
 +
* each instance will be launched with <code>SLURM_ARRAY_TASK_ID</code> set to a number in the specified range
 +
* there will be at most 20 parallel tasks running at once. This is useful for a larger number of tasks - this way we ensure that we do not flood the cluster with requests.
 +
 
 +
You can read more about ''array jobs'' from the [https://slurm.schedmd.com/job_array.html SLURM documentation].

Latest revision as of 15:49, 23 April 2024

The CPU jobs should be submitted to cpu partition.

You can submit a non-interactive job using the sbatch command. To submit an interactive job, use the srun command:

srun --pty bash

Resource specification

You should specify the memory and CPU requirements (if higher than the defaults) and don't exceed them. If your job needs more than one CPU (thread) (on a single machine) for most of the time, reserve the given number of CPU threads with the --cpus-per-task and memory with the --mem options.

srun -p cpu --cpus-per-task=4 --mem=8G --pty bash

This will give you an interactive shell with 4 threads and 8G RAM on the cpu partition.

Monitoring and interaction

Job monitoring

We should be able to see what is going on when we run a job. Following examples shows usage of some typical commands:

  • squeue -a - this shows the jobs in all partitions.
  • squeue -u user - print a list of running/waiting jobs of a given user
  • squeue -j<JOB_ID> - this shows detailed info about the job with given JOB_ID (if it is still running).
  • sinfo - print available/total resources

Job interaction

  • scontrol show job JOBID - this shows details of running job with JOBID
  • scancel JOBID - delete job from the queue

Selected submit options

The complete list of available options for the commands srun and sbatch can be found in SLURM documentation. Most of the options listed here can be entered as a command parameters or as an SBATCH directive inside of a script.

 -J helloWorld         # name of job
 --chdir /job/path/    # path where the job will be executed
 -p gpu                # name of partition or queue (if not specified default partition is used)
 -q normal             # QOS level (sets priority of the job)
 -c 4                  # reserve 4 CPU threads
 --gres=gpu:1          # reserve 1 GPU card
 -o script.out         # name of output file for the job 
 -e script.err         # name of error file for the job

Array jobs

If you need to submit rather large number of jobs which are similar (i.e. processing a large number of input files) you should consider launching an array job.

For example, one might need to process 1000 files named file_N.txt (where N is a number between 1-1000). A program that can process one file is called crunchFile and it takes only one argument - the name of the file to process. Instead of calling 1000x:

  sbatch crunchFile file_N.txt

we can write a wrapper script crunchScript.sh referring to the SLURM variable SLURM_ARRAY_TASK_ID:

  #!/bin/bash
  #SBATCH -p CPU
  #SBATCH --mem 2G
  
  crunchFile name_${SLURM_ARRAY_TASK_ID}.txt

and submit all the jobs at once as an array job:

 sbatch --array=1-1000%20 crunchScript.sh

Where the option --array 1-1000%20 means that we want SLURM to:

  • launch 1000 instances of crunchScript.sh
  • each instance will be launched with SLURM_ARRAY_TASK_ID set to a number in the specified range
  • there will be at most 20 parallel tasks running at once. This is useful for a larger number of tasks - this way we ensure that we do not flood the cluster with requests.

You can read more about array jobs from the SLURM documentation.