|
|
(33 intermediate revisions by 3 users not shown) |
Line 1: |
Line 1: |
− | == Resource specification ==
| + | The CPU jobs should be submitted to <code>cpu</code> partition. |
− | == Monitoring and interaction ==
| |
| | | |
− | == Job monitoring ==
| + | You can submit a non-interactive job using the '''sbatch''' command. |
− | We should be able to see what is going on when we run a job. Following examples shows typical usage of the command <code>qstat</code>:
| + | To submit an interactive job, use the '''srun''' command: |
− | * <code>qstat</code> - this way we inspect all our jobs (both waiting in the queue and scheduled, i.e. running).
| |
− | * <code>qstat -u '*' | less</code> - this shows the jobs of all users.
| |
− | * <code>qstat -j 121144</code> - this shows detailed info about the job with this number (if it is still running).
| |
| | | |
− | == Output monitoring ==
| + | srun --pty bash |
− | If we need to see output produced by our job (suppose the ID is 121144), we can inspect the job's output (in our case stored in <code>job_script.sh.o121144</code>) with:<br>
| |
− | <code>less job_script.sh.o*</code><br>
| |
− | ''Hint:'' if the job is still running, press '''F''' in <code>less</code> to simulate <code>tail -f</code>.
| |
| | | |
− | === How to read output epilog === | + | == Resource specification == |
− | The epilog section contains some interesting pieces of information. However this it can get confusing sometimes.
| |
| | | |
− | ======= EPILOG: Tue Jun 4 12:41:07 CEST 2019
| + | You should specify the memory and CPU requirements (if higher than the defaults) and don't exceed them. |
− | == Limits:
| + | If your job needs more than one CPU (thread) (on a single machine) for most of the time, reserve the given number of CPU threads with the <code>--cpus-per-task</code> and memory with the <code>--mem</code> options. |
− | == Usage: cpu=00:00:00, mem=0.00000 GB s, io=0.00000 GB, vmem=N/A, maxvmem=N/A
| |
− | == Duration: 00:00:00 (0 s) | |
− | == Server name: cpu-node13
| |
| | | |
− | * ''Limits'' - on this line you can see job limits specified through <code>qsub</code> options
| + | srun -p cpu --cpus-per-task=4 --mem=8G --pty bash |
− | * ''Usage'' - resource usage during computation
| + | |
− | ** ''cpu=HH:MM:SS'' - the accumulated CPU time usage
| + | This will give you an interactive shell with 4 threads and 8G RAM on the ''cpu'' partition. |
− | ** ''mem=XY GB'' - gigabytes of RAM used times the duration of the job in seconds, so don't be afraid XY is usually a very high number (unlike in this toy example)
| |
− | ** ''io=XY GB'' - the amount of data transferred in input/output operations in GB
| |
− | ** ''vmem=XY'' - actual virtual memory consumption when the job finished
| |
− | ** ''maxvmem=XY'' - peak virtual memory consumption
| |
− | * ''Duration'' - total execution time
| |
− | * ''Server name'' - name of the executing server
| |
| | | |
− | == Advanced usage == | + | == Monitoring and interaction == |
− | <code>qsub '''-q''' cpu.q</code>
| |
− | This way your job is submitted to the CPU queue which is the default. If you need GPU use <code>gpu.q</code> instead.
| |
− | | |
− | <code>qsub '''-l''' ...</code>
| |
− | See <code>man complex'' (run it on lrc or sol machines) for a list of possible resources you may require (in addition to ''mem_free</code> etc. discussed above).
| |
− | | |
− | <code>qsub '''-p''' -200</code>
| |
− | Define a priority of your job as a number between -1024 and 0. Only SGE admins may use a number higher than 0. In January 2018, we changed the default to -100 (it used to be 0). Please, do not use priority between -99 and 0 for jobs taking longer than a few hours, unless it is absolutely necessary for a deadline. In that case, please notify other GPU users. You should ask for lower priority (-1024..-101) if you submit many jobs at once or if the jobs are not urgent. SGE uses the priority to decide when to start which pending job in the queue (it computes a real number called <code>prior'', which is reported in ''qstat</code>, which grows as the job is waiting in the queue). Note that once a job is started, you cannot "unschedule" it, so from that moment on, it is irrelevant what was its priority.
| |
− | | |
− | <code>qsub '''-o** LOG.stdout **-e''' LOG.stderr</code>
| |
− | redirect std{out,err} to separate files with given names, instead of the defaults <code>$JOB_NAME.o$JOB_ID'' and ''$JOB_NAME.e$JOB_ID</code>.
| |
− | | |
− | <code>qsub '''-@''' optionfile</code>
| |
− | Instead of specifying all the <code>qsub</code> options on the command line, you can store them in a file (you can use # comments in the file). See also [[#In-script options]].
| |
− | | |
− | <code>qsub '''-a''' 12312359</code>
| |
− | Execute your job no sooner than at the given time (in <code>[YY]MMDDhhmm'' format). An alternative to ''sleep 3600 && qsub ... &</code>.
| |
− | | |
− | <code>qsub '''-b''' y</code>
| |
− | Treat <code>script.sh'' (or whatever is the name of the command you execute) as a binary, i.e. don't search for [[#in-script options]] within the file, don't transfer it to the qmaster and then to the execution node. This makes the execution a bit faster and it may prevent some rare but hard-to-detect errors caused SGE interpreting the script. The script must be available on the execution node via NFS, Lustre (which is our case), etc. With ''-b y'' (shortcut for ''-b yes''), ''script.sh'' can be a script or a binary. With ''-b n'' (which is the default for ''qsub''), ''script.sh</code> must be a script (text file).
| |
− | | |
− | <code>qsub '''-M** popel@ufal.mff.cuni.cz,rosa@ufal.mff.cuni.cz **-m''' beas</code>
| |
− | Specify the emails where you want to be notified when the job has been '''b** started, **e** ended, **a** aborted or rescheduled, **s''' suspended.
| |
− | The default is now <code>-m a'' and the default email address is forwarded to you (so there is no need to use ''-M''). You can use ''-m n</code> to override the defaults and send no emails.
| |
− | | |
− | <code>qsub '''-hold_jid** 121144,121145'' (or ''qsub **-hold_jid''' get_src.sh,get_tgt.sh</code>)
| |
− | The current job is not executed before all the specified jobs are completed.
| |
− | | |
− | <code>qsub '''-now''' y</code>
| |
− | Start the job immediately or not at all, i.e. don't put it as pending to the queue. This is the default for <code>qrsh'', but you can change it with ''-now n'' (which is the default for ''qsub</code>).
| |
− | | |
− | <code>qsub '''-N''' my-name</code>
| |
− | By default the name of a job (which you can see e.g. in <code>qstat'') is the name of the ''script.sh</code>. This way you can override it.
| |
− | | |
− | <code>qsub '''-S''' /bin/bash</code>
| |
− | The hashbang (<code>!#/bin/bash'') in your ''script.sh'' is ignored, but you can change the interpreter with ''-S''. I think ''/bin/bash'' is now (2017/09) the default (but it used to be ''csh</code>).
| |
− | | |
− | <code>qsub '''-v''' PATH[=value]</code>
| |
− | Export a given environment variable from the current shell to the job.
| |
| | | |
− | <code>qsub '''-V'''</code> | + | === Job monitoring === |
− | Export all environment variables. (This is not so needed now, when bash is the default interpreter and it seems your <code>~/.bashrc</code> is always sourced.)
| + | We should be able to see what is going on when we run a job. Following examples shows usage of some typical commands: |
| + | * <code>squeue -a</code> - this shows the jobs in all partitions. |
| + | * <code>squeue -u user</code> - print a list of running/waiting jobs of a given user |
| + | * <code>squeue -j<JOB_ID></code> - this shows detailed info about the job with given JOB_ID (if it is still running). |
| + | * <code>sinfo</code> - print available/total resources |
| | | |
− | <code>qsub '''-soft** -l ... **-hard''' -l ... -q ...</code> | + | === Job interaction === |
− | By default, all the resource requirements (specified with <code>-l'') and queue requirements (specified with ''-q'') are //hard//, i.e. your job won't be scheduled unless they can be fulfilled. You can use ''-soft'' to mark all following requirements as nice-to-have. And with ''-hard</code> you can switch back to hard requirements.
| + | * <code>scontrol show job JOBID</code> - this shows details of running job with JOBID |
| + | * <code>scancel JOBID</code> - delete job from the queue |
| | | |
− | <code>qsub '''-sync''' y</code> | + | === Selected submit options === |
− | This causes qsub to wait for the job to complete before exiting (with the same exit code as the job). Useful in scripts.
| + | The complete list of available options for the commands <code>srun</code> and <code>sbatch</code> can be found in [https://slurm.schedmd.com/man_index.html SLURM documentation]. Most of the options listed here can be entered as a command parameters or as an SBATCH directive inside of a script. |
| | | |
− | <code>'''qalter'''</code>
| + | -J helloWorld # name of job |
− | You can change some properties of already submitted jobs (both waiting in the queue and running). Changeable properties are listed in <code>man qsub</code>.
| + | -p gpu # name of partition or queue (if not specified default partition is used) |
| + | -q normal # QOS level (sets priority of the job) |
| + | -c 4 # reserve 4 CPU threads |
| + | --gres=gpu:1 # reserve 1 GPU card |
| + | -o script.out # name of output file for the job |
| + | -e script.err # name of error file for the job |