Submitting CPU Jobs

From UFAL AIC

The CPU jobs should be submitted to cpu.q queue.

TL;DR: You can submit a non-interactive job requiring %M% GB RAM and %C% CPUs (at most 4) by running

qsub -q cpu.q -cwd -b y -pe smp %C% -l mem_free=%M%G,act_mem_free=%M%G,h_data=%M%G path_to_binary arguments

To submit an interactive job, replace qsub by qrsh and you can leave out path_to_binary arguments.

Resource specification

You should specify the memory and CPU requirements (if higher than the defaults) and don't exceed them. If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with

qsub -pe smp <number-of-CPU-cores>

The maximum for AIC cluster is 4 cores. If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%, it is OK to reserve just one core (so you don't waste).

If you are sure your job needs less than 1GB RAM, then you can skip this. Otherwise, if you need e.g. 8 GiB, you must always use qsub (or qrsh) with -l mem_free=8G. You should specify also act_mem_free with the same value and h_vmem with possibly a slightly bigger value. See #Memory for details.

qsub -l mem_free=8G,act_mem_free=8G,h_vmem=12G

Monitoring and interaction

Job monitoring

We should be able to see what is going on when we run a job. Following examples shows usage of some typical commands:

  • qstat - this way we inspect all our jobs (both waiting in the queue and scheduled, i.e. running).
  • qstat [-u user] - print a list of running/waiting jobs of a given user
  • qstat -u '*' | less - this shows the jobs of all users.
  • qstat -j 121144 - this shows detailed info about the job with this number (if it is still running).
  • qhost - print available/total resources
  • qacct -j job_id - print info even for ended job (for which qstat -j job_id does not work). See man qacct for more.

Output monitoring

If we need to see output produced by our job (suppose the ID is 121144), we can inspect the job's output (in our case stored in job_script.sh.o121144) with:
less job_script.sh.o*
Hint: if the job is still running, press F in less to simulate tail -f.

How to read output epilog

The epilog section contains some interesting pieces of information. However this it can get confusing sometimes.

======= EPILOG: Tue Jun 4 12:41:07 CEST 2019
== Limits:   
== Usage:    cpu=00:00:00, mem=0.00000 GB s, io=0.00000 GB, vmem=N/A, maxvmem=N/A
== Duration: 00:00:00 (0 s)
== Server name: cpu-node13
  • Limits - on this line you can see job limits specified through qsub options
  • Usage - resource usage during computation
    • cpu=HH:MM:SS - the accumulated CPU time usage
    • mem=XY GB - gigabytes of RAM used times the duration of the job in seconds, so don't be afraid XY is usually a very high number (unlike in this toy example)
    • io=XY GB - the amount of data transferred in input/output operations in GB
    • vmem=XY - actual virtual memory consumption when the job finished
    • maxvmem=XY - peak virtual memory consumption
  • Duration - total execution time
  • Server name - name of the executing server

Job interaction

qdel 121144 This way you can delete (kill) a job with a given number, or comma-or-space separated list of job numbers.

qdel \* This way you can delete all your jobs. Don't be afraid - you cannot delete others jobs.

qalter You can change some properties of already submitted jobs (both waiting in the queue and running). Changeable properties are listed in man qsub.

Advanced usage

qsub -q cpu.q This way your job is submitted to the CPU queue which is the default. If you need GPU use gpu.q instead.

qsub -l ... See man complex (run it on aic) for a list of possible resources you may require (in addition to mem_free etc. discussed above).

qsub -p -200 Define a priority of your job as a number between -1024 and 0. Only SGE admins may use a number higher than 0. Default is set to TODO. You should ask for lower priority (-1024..-101) if you submit many jobs at once or if the jobs are not urgent. SGE uses the priority to decide when to start which pending job in the queue (it computes a real number called prior, which is reported in qstat, which grows as the job is waiting in the queue). Note that once a job is started, you cannot unschedule it, so from that moment on, it is irrelevant what was its priority.

qsub -o LOG.stdout -e LOG.stderr redirect std{out,err} to separate files with given names, instead of the defaults $JOB_NAME.o$JOB_ID and $JOB_NAME.e$JOB_ID.

qsub -@ optionfile Instead of specifying all the qsub options on the command line, you can store them in a file (you can use # comments in the file).

qsub -a 12312359 Execute your job no sooner than at the given time (in [YY]MMDDhhmm format). An alternative to sleep 3600 && qsub ... &.

qsub -b y Treat script.sh (or whatever is the name of the command you execute) as a binary, i.e. don't search for in-script options within the file, don't transfer it to the qmaster and then to the execution node. This makes the execution a bit faster and it may prevent some rare but hard-to-detect errors caused SGE interpreting the script. The script must be available on the execution node via Lustre (which is our case), etc. With -b y (shortcut for -b yes), script.sh can be a script or a binary. With -b n (which is the default for qsub), script.sh must be a script (text file).

qsub -M person1@email.somewhere.cz,person2@email.somewhere.cz -m beas Specify the emails where you want to be notified when the job has been b started, e ended, a aborted, rescheduled or s suspended. The default is now -m a and the default email address is forwarded to you (so there is no need to use -M). You can use -m n to override the defaults and send no emails.

qsub -hold_jid 121144,121145 (or qsub -hold_jid get_src.sh,get_tgt.sh) The current job is not executed before all the specified jobs are completed.

qsub -now y Start the job immediately or not at all, i.e. don't put it as pending to the queue. This is the default for qrsh, but you can change it with -now n (which is the default for qsub).

qsub -N my-name By default the name of a job (which you can see e.g. in qstat) is the name of the script.sh. This way you can override it.

qsub -S /bin/bash The hashbang (!#/bin/bash) in your script.sh is ignored, but you can change the interpreter with -S. The default interpreter is /bin/bash.

qsub -v PATH[=value] Export a given environment variable from the current shell to the job.

qsub -V Export all environment variables. (This is not so needed now, when bash is the default interpreter and it seems your ~/.bashrc is always sourced.)

qsub -soft -l ... -hard -l ... -q ... By default, all the resource requirements (specified with -l) and queue requirements (specified with -q) are hard, i.e. your job won't be scheduled unless they can be fulfilled. You can use -soft to mark all following requirements as nice-to-have. And with -hard you can switch back to hard requirements.

qsub -sync y This causes qsub to wait for the job to complete before exiting (with the same exit code as the job). Useful in scripts.

Memory

  • There are three commonly used options for specifying memory requirements: mem_free, act_mem_free and h_vmem. Each has a different purpose.
  • mem_free=1G means 1024×1024×1024 bytes, i.e. one [(gibibyte)]. mem_free=1g means 1000×1000×1000 bytes, i.e. one gigabyte. Similarly for the other options and other prefixes (k, K, m, M).
  • mem_free (or mf) specifies a consumable resource tracked by SGE and it affects job scheduling. Each machine has an initial value assigned (slightly lower than the real total physical RAM capacity). When you specify qsub -l mem_free=4G, SGE finds a machine with mem_free >= 4GB, and subtracts 4GB from it. This limit is not enforced, so if a job exceeds this limit, it is not automatically killed and thus the SGE value of mem_free may not represent the real free memory. The default value is 1G. By not using this option and eating more than 1 GiB, you are breaking the rules.
  • act_mem_free (or amf) is a ÚFAL-specific option, which specifies the real amount of free memory (at the time of scheduling). You can specify it when submitting a job and it will be scheduled to a machine with at least this amount of memory free. In an ideal world, where no jobs are exceeding their mem_free requirements, we would not need this option. However, in the real world, it is recommended to use this option with the same value as mem_free to protect your job from failing with out-of-memory error (because of naughty jobs of other users).
  • h_vmem is equivalent to setting ulimit -v, i.e. it is a hard limit on the size of virtual memory (see RLIMIT_AS in man setrlimit). If your job exceeds this limit, memory allocation fails (i.e., malloc or mmap will return NULL), and your job will probably crash on SIGSEGV. TODO: according to man queue_conf, the job is killed with SIGKILL, not with SIGSEGV. Note that h_vmem specifies the maximal size of allocated_memory, not used_memory, in other words it is the VIRT column in top, not the RES column. SGE does not use this parameter in any other way. Notably, job scheduling is not affected by this parameter and therefore there is no guarantee that there will be this amount of memory on the chosen machine. The problem is that some programs (e.g. Java with the default setting) allocate much more (virtual) memory than they actually use in the end. If we want to be ultra conservative, we should set h_vmem to the same value as mem_free. If we want to be only moderately conservative, we should specify something like h_vmem=1.5*mem_free, because some jobs will not use the whole mem_free requested, but still our job will be killed if it allocated much more than declared. The default effectively means that your job has no limits.
  • For GPU jobs, it is usually better to use h_data instead of h_vmem. CUDA driver allocates a lot of unused virtual memory (tens of GB per card), which is counted in h_vmem, but not in h_data. All usual allocations (malloc, new, Python allocations) seem to be included in h_data.
  • It is recommended to profile your task first (see #Profiling below, so you can estimate reasonable memory requirements before submitting many jobs with the same task (varying in parameters which do not affect memory consumption). So for the first time, declare mem_free with much more memory than expected and ssh to a given machine and check htop (sum all processes of your job) or (if the job is done quickly) check the epilog. When running other jobs of this type, set mem_free (and act_mem_free and h_vmem) so you are not wasting resources, but still have some reserve.
  • s_vmem is similar to h_vmem, but instead of SIGSEGV/SIGKILL, the job is sent a SIGXCPU signal which can be caught by the job and exit gracefully before it is killed. So if you need it, set s_vmem to a lower value than h_vmem and implement SIGXCPU handling and cleanup.

Profiling

As stated above, you should always specify the exact memory limits when running your tasks, so that you neither waste RAM nor starve others of memory by using more than you requested. However, memory requirements can be difficult to estimate in advance. That's why you should profile your tasks first.

A simple method is to run the task and observe the memory usage reported in the epilog, but SGE may not record transient allocations. As documented in man 5 accounting and observed in qconf -sconf, SGE only collects stats every accounting_flush_time. If this is not set, it defaults to flush_time, which is preset to 15 seconds. But the kernel records all info immediately without polling, and you can view these exact stats by looking into /proc/$PID/status while the task is running.

You can still miss allocations made shortly before the program exits – which often happens when trying to debug why your program gets killed by SGE after exhausting the reserved space. To record these, use /usr/bin/time -v (the actual binary, not the shell-builtin command time). Be aware that unlike the builtin, it cannot measure shell functions and behaves differently on pipelines.

Obtaining peak usage of multiprocess applications is trickier. Detached and backgrounded processes are ignored completely by time -v and you get the maximum footprint of any children, not the sum of all maximal footprints nor the largest footprint in any instant.

If you program in C and need to know the peak memory usage of your children, you can also use the wait4() syscall and calculate the stats yourself.

If your job is the only one on a given machine, you can also look how much free memory is left when running the job (e.g. with htop if you know when is the peak moment).