Submitting CPU Jobs

From UFAL AIC
Revision as of 10:05, 12 June 2019 by Vodrazka (talk | contribs)

Resource specification

Monitoring and interaction

Job monitoring

We should be able to see what is going on when we run a job. Following examples shows typical usage of the command qstat:

  • qstat - this way we inspect all our jobs (both waiting in the queue and scheduled, i.e. running).
  • qstat -u '*' | less - this shows the jobs of all users.
  • qstat -j 121144 - this shows detailed info about the job with this number (if it is still running).

Output monitoring

If we need to see output produced by our job (suppose the ID is 121144), we can inspect the job's output (in our case stored in job_script.sh.o121144) with:
less job_script.sh.o*
Hint: if the job is still running, press F in less to simulate tail -f.

How to read output epilog

The epilog section contains some interesting pieces of information. However this it can get confusing sometimes.

======= EPILOG: Tue Jun 4 12:41:07 CEST 2019
== Limits:   
== Usage:    cpu=00:00:00, mem=0.00000 GB s, io=0.00000 GB, vmem=N/A, maxvmem=N/A
== Duration: 00:00:00 (0 s)
== Server name: cpu-node13
  • Limits - on this line you can see job limits specified through qsub options
  • Usage - resource usage during computation
    • cpu=HH:MM:SS - the accumulated CPU time usage
    • mem=XY GB - gigabytes of RAM used times the duration of the job in seconds, so don't be afraid XY is usually a very high number (unlike in this toy example)
    • io=XY GB - the amount of data transferred in input/output operations in GB
    • vmem=XY - actual virtual memory consumption when the job finished
    • maxvmem=XY - peak virtual memory consumption
  • Duration - total execution time
  • Server name - name of the executing server

Job interaction

qalter You can change some properties of already submitted jobs (both waiting in the queue and running). Changeable properties are listed in man qsub.


Advanced usage

qsub -q cpu.q This way your job is submitted to the CPU queue which is the default. If you need GPU use gpu.q instead.

qsub -l ... See man complex (run it on aic) for a list of possible resources you may require (in addition to mem_free etc. discussed above).

qsub -p -200 Define a priority of your job as a number between -1024 and 0. Only SGE admins may use a number higher than 0. Default is set to TODO. You should ask for lower priority (-1024..-101) if you submit many jobs at once or if the jobs are not urgent. SGE uses the priority to decide when to start which pending job in the queue (it computes a real number called prior, which is reported in qstat, which grows as the job is waiting in the queue). Note that once a job is started, you cannot unschedule it, so from that moment on, it is irrelevant what was its priority.

qsub -o LOG.stdout -e LOG.stderr redirect std{out,err} to separate files with given names, instead of the defaults $JOB_NAME.o$JOB_ID and $JOB_NAME.e$JOB_ID.

qsub -@ optionfile Instead of specifying all the qsub options on the command line, you can store them in a file (you can use # comments in the file).

qsub -a 12312359 Execute your job no sooner than at the given time (in [YY]MMDDhhmm format). An alternative to sleep 3600 && qsub ... &.

qsub -b y Treat script.sh (or whatever is the name of the command you execute) as a binary, i.e. don't search for in-script options within the file, don't transfer it to the qmaster and then to the execution node. This makes the execution a bit faster and it may prevent some rare but hard-to-detect errors caused SGE interpreting the script. The script must be available on the execution node via Lustre (which is our case), etc. With -b y (shortcut for -b yes), script.sh can be a script or a binary. With -b n (which is the default for qsub), script.sh must be a script (text file).

qsub -M person1@email.somewhere.cz,person2@email.somewhere.cz -m beas Specify the emails where you want to be notified when the job has been b started, e ended, a aborted, rescheduled or s suspended. The default is now -m a and the default email address is forwarded to you (so there is no need to use -M). You can use -m n to override the defaults and send no emails.

qsub -hold_jid 121144,121145 (or qsub -hold_jid get_src.sh,get_tgt.sh) The current job is not executed before all the specified jobs are completed.

qsub -now y Start the job immediately or not at all, i.e. don't put it as pending to the queue. This is the default for qrsh, but you can change it with -now n (which is the default for qsub).

qsub -N my-name By default the name of a job (which you can see e.g. in qstat) is the name of the script.sh. This way you can override it.

qsub -S /bin/bash The hashbang (!#/bin/bash) in your script.sh is ignored, but you can change the interpreter with -S. The default interpreter is /bin/bash.

qsub -v PATH[=value] Export a given environment variable from the current shell to the job.

qsub -V Export all environment variables. (This is not so needed now, when bash is the default interpreter and it seems your ~/.bashrc is always sourced.)

qsub -soft -l ... -hard -l ... -q ... By default, all the resource requirements (specified with -l) and queue requirements (specified with -q) are hard, i.e. your job won't be scheduled unless they can be fulfilled. You can use -soft to mark all following requirements as nice-to-have. And with -hard you can switch back to hard requirements.

qsub -sync y This causes qsub to wait for the job to complete before exiting (with the same exit code as the job). Useful in scripts.