Difference between revisions of "Submitting CPU Jobs"

From UFAL AIC
(How to read output epilog)
(Monitoring and interaction)
 
(35 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
The CPU jobs should be submitted to <code>cpu</code> partition.
 +
 +
You can submit a non-interactive job using the '''sbatch''' command.
 +
To submit an interactive job, use the '''srun''' command:
 +
 +
srun --pty bash
 +
 
== Resource specification ==
 
== Resource specification ==
== Monitoring and interaction ==
 
  
== Job monitoring ==
+
You should specify the memory and CPU requirements (if higher than the defaults) and don't exceed them.
We should be able to see what is going on when we run a job. Following examples shows typical usage of the command <code>qstat</code>:
+
If your job needs more than one CPU (thread) (on a single machine) for most of the time, reserve the given number of CPU threads with the <code>--cpus-per-task</code> and memory with the <code>--mem</code> options.
* <code>qstat</code> - this way we inspect all our jobs (both waiting in the queue and scheduled, i.e. running).
+
 
* <code>qstat -u '*' | less</code> - this shows the jobs of all users.
+
srun -p cpu --cpus-per-task=4 --mem=8G --pty bash
* <code>qstat -j 121144</code> - this shows detailed info about the job with this number (if it is still running).
+
 +
This will give you an interactive shell with 4 threads and 8G RAM on the ''cpu'' partition.
  
== Output monitoring ==
+
== Monitoring and interaction ==
If we need to see output produced by our job (suppose the ID is 121144), we can inspect the job's output (in our case stored in <code>job_script.sh.o121144</code>) with:<br>
 
<code>less job_script.sh.o*</code><br>
 
''Hint:'' if the job is still running, press '''F''' in <code>less</code> to simulate <code>tail -f</code>.
 
  
=== How to read output epilog ===
+
=== Job monitoring ===
The epilog section contains some interesting pieces of information. However this it can get confusing sometimes.
+
We should be able to see what is going on when we run a job. Following examples shows usage of some typical commands:
 +
* <code>squeue -a</code> - this shows the jobs in all partitions.
 +
* <code>squeue -u user</code> - print a list of running/waiting jobs of a given user
 +
* <code>squeue -j<JOB_ID></code> - this shows detailed info about the job with given JOB_ID (if it is still running).
 +
* <code>sinfo</code> - print available/total resources
  
======= EPILOG: Tue Jun 4 12:41:07 CEST 2019
+
=== Job interaction ===
== Limits: 
+
* <code>scontrol show job JOBID</code> - this shows details of running job with JOBID
== Usage:    cpu=00:00:00, mem=0.00000 GB s, io=0.00000 GB, vmem=N/A, maxvmem=N/A
+
* <code>scancel JOBID</code> - delete job from the queue
== Duration: 00:00:00 (0 s)
 
== Server name: cpu-node13
 
  
* ''Limits'' - on this line you can see job limits specified through <code>qsub</code> options
+
=== Selected submit options ===
* ''Usage'' - resource usage during computation
+
The complete list of available options for the commands <code>srun</code> and <code>sbatch</code> can be found in [https://slurm.schedmd.com/man_index.html SLURM documentation]. Most of the options listed here can be entered as a command parameters or as an SBATCH directive inside of a script.
** ''cpu=HH:MM:SS'' - the accumulated CPU time usage
 
** ''mem=XY GB'' - gigabytes of RAM used times the duration of the job in seconds, so don't be afraid XY is usually a very high number (unlike in this toy example)
 
** ''io=XY GB'' - the amount of data transferred in input/output operations in GB
 
** ''vmem=XY''
 
** ''maxvmem=XY''
 
* ''Duration''
 
* ''Server name''
 
  
== Output ==
+
  -J helloWorld        # name of job
== Logs ==
+
  -p gpu                # name of partition or queue (if not specified default partition is used)
 +
  -q normal            # QOS level (sets priority of the job)
 +
  -c 4                  # reserve 4 CPU threads
 +
  --gres=gpu:1          # reserve 1 GPU card
 +
  -o script.out        # name of output file for the job
 +
  -e script.err        # name of error file for the job

Latest revision as of 14:17, 1 December 2022

The CPU jobs should be submitted to cpu partition.

You can submit a non-interactive job using the sbatch command. To submit an interactive job, use the srun command:

srun --pty bash

Resource specification

You should specify the memory and CPU requirements (if higher than the defaults) and don't exceed them. If your job needs more than one CPU (thread) (on a single machine) for most of the time, reserve the given number of CPU threads with the --cpus-per-task and memory with the --mem options.

srun -p cpu --cpus-per-task=4 --mem=8G --pty bash

This will give you an interactive shell with 4 threads and 8G RAM on the cpu partition.

Monitoring and interaction

Job monitoring

We should be able to see what is going on when we run a job. Following examples shows usage of some typical commands:

  • squeue -a - this shows the jobs in all partitions.
  • squeue -u user - print a list of running/waiting jobs of a given user
  • squeue -j<JOB_ID> - this shows detailed info about the job with given JOB_ID (if it is still running).
  • sinfo - print available/total resources

Job interaction

  • scontrol show job JOBID - this shows details of running job with JOBID
  • scancel JOBID - delete job from the queue

Selected submit options

The complete list of available options for the commands srun and sbatch can be found in SLURM documentation. Most of the options listed here can be entered as a command parameters or as an SBATCH directive inside of a script.

 -J helloWorld         # name of job
 -p gpu                # name of partition or queue (if not specified default partition is used)
 -q normal             # QOS level (sets priority of the job)
 -c 4                  # reserve 4 CPU threads
 --gres=gpu:1          # reserve 1 GPU card
 -o script.out         # name of output file for the job 
 -e script.err         # name of error file for the job