|
|
(16 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
− | == Resource specification ==
| + | The CPU jobs should be submitted to <code>cpu</code> partition. |
| | | |
− | You should specify the memory and CPU requirements (if higher than the defaults) and don't exceed them. | + | You can submit a non-interactive job using the '''sbatch''' command. |
− | If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with
| + | To submit an interactive job, use the '''srun''' command: |
| | | |
− | <code>qsub -pe smp <number-of-CPU-cores></code>
| + | srun --pty bash |
| | | |
− | The maximum for AIC cluster is 4 cores. If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%, it is OK to reserve just one core (so you don't waste).
| + | == Resource specification == |
| | | |
− | If you are sure your job needs less than 1GB RAM, then you can skip this. Otherwise, if you need e.g. 8 GiB, you must always use <code>qsub</code> (or <code>qrsh</code>) with <code>-l mem_free=8G</code>. You should specify also <code>act_mem_free</code> with the same value and <code>h_vmem</code> with possibly a slightly bigger value. See [[#Memory]] for details. | + | You should specify the memory and CPU requirements (if higher than the defaults) and don't exceed them. |
| + | If your job needs more than one CPU (thread) (on a single machine) for most of the time, reserve the given number of CPU threads with the <code>--cpus-per-task</code> and memory with the <code>--mem</code> options. |
| | | |
− | <code>qsub -l mem_free=8G,act_mem_free=8G,h_vmem=12G</code>
| + | srun -p cpu --cpus-per-task=4 --mem=8G --pty bash |
| + | |
| + | This will give you an interactive shell with 4 threads and 8G RAM on the ''cpu'' partition. |
| | | |
| == Monitoring and interaction == | | == Monitoring and interaction == |
Line 16: |
Line 19: |
| === Job monitoring === | | === Job monitoring === |
| We should be able to see what is going on when we run a job. Following examples shows usage of some typical commands: | | We should be able to see what is going on when we run a job. Following examples shows usage of some typical commands: |
− | * <code>qstat</code> - this way we inspect all our jobs (both waiting in the queue and scheduled, i.e. running). | + | * <code>squeue -a</code> - this shows the jobs in all partitions. |
− | * <code>qstat [-u user]</code> - print a list of running/waiting jobs of a given user | + | * <code>squeue -u user</code> - print a list of running/waiting jobs of a given user |
− | * <code>qstat -u '*' | less</code> - this shows the jobs of all users. | + | * <code>squeue -j<JOB_ID></code> - this shows detailed info about the job with given JOB_ID (if it is still running). |
− | * <code>qstat -j 121144</code> - this shows detailed info about the job with this number (if it is still running).
| + | * <code>sinfo</code> - print available/total resources |
− | * <code>qhost</code> - print available/total resources | |
− | * <code>qacct -j job_id</code> - print info even for ended job (for which ''qstat -j job_id'' does not work). See <code>man qacct</code> for more.
| |
− | | |
− | === Output monitoring ===
| |
− | If we need to see output produced by our job (suppose the ID is 121144), we can inspect the job's output (in our case stored in <code>job_script.sh.o121144</code>) with:<br>
| |
− | <code>less job_script.sh.o*</code><br>
| |
− | ''Hint:'' if the job is still running, press '''F''' in <code>less</code> to simulate <code>tail -f</code>.
| |
− | | |
− | ==== How to read output epilog ====
| |
− | The epilog section contains some interesting pieces of information. However this it can get confusing sometimes.
| |
− | | |
− | ======= EPILOG: Tue Jun 4 12:41:07 CEST 2019
| |
− | == Limits:
| |
− | == Usage: cpu=00:00:00, mem=0.00000 GB s, io=0.00000 GB, vmem=N/A, maxvmem=N/A
| |
− | == Duration: 00:00:00 (0 s)
| |
− | == Server name: cpu-node13
| |
− | | |
− | * ''Limits'' - on this line you can see job limits specified through <code>qsub</code> options
| |
− | * ''Usage'' - resource usage during computation
| |
− | ** ''cpu=HH:MM:SS'' - the accumulated CPU time usage
| |
− | ** ''mem=XY GB'' - gigabytes of RAM used times the duration of the job in seconds, so don't be afraid XY is usually a very high number (unlike in this toy example)
| |
− | ** ''io=XY GB'' - the amount of data transferred in input/output operations in GB
| |
− | ** ''vmem=XY'' - actual virtual memory consumption when the job finished
| |
− | ** ''maxvmem=XY'' - peak virtual memory consumption
| |
− | * ''Duration'' - total execution time
| |
− | * ''Server name'' - name of the executing server
| |
| | | |
| === Job interaction === | | === Job interaction === |
| + | * <code>scontrol show job JOBID</code> - this shows details of running job with JOBID |
| + | * <code>scancel JOBID</code> - delete job from the queue |
| | | |
− | <code>qdel 121144</code>
| + | === Selected submit options === |
− | This way you can delete (''kill'') a job with a given number, or comma-or-space separated list of job numbers.
| + | The complete list of available options for the commands <code>srun</code> and <code>sbatch</code> can be found in [https://slurm.schedmd.com/man_index.html SLURM documentation]. Most of the options listed here can be entered as a command parameters or as an SBATCH directive inside of a script. |
− | | |
− | <code>qdel \*</code>
| |
− | This way you can delete all your jobs. Don't be afraid - you cannot delete others jobs.
| |
− | | |
− | <code>qalter</code>
| |
− | You can change some properties of already submitted jobs (both waiting in the queue and running). Changeable properties are listed in <code>man qsub</code>.
| |
− | | |
− | == Advanced usage == | |
− | <code>qsub '''-q''' cpu.q</code>
| |
− | This way your job is submitted to the CPU queue which is the default. If you need GPU use <code>gpu.q</code> instead.
| |
− | | |
− | <code>qsub '''-l''' ...</code>
| |
− | See <code>man complex</code> (run it on aic) for a list of possible resources you may require (in addition to <code>mem_free</code> etc. discussed above).
| |
− | | |
− | <code>qsub '''-p''' -200</code>
| |
− | Define a priority of your job as a number between -1024 and 0. Only SGE admins may use a number higher than 0. Default is set to TODO. You should ask for lower priority (-1024..-101) if you submit many jobs at once or if the jobs are not urgent. SGE uses the priority to decide when to start which pending job in the queue (it computes a real number called <code>prior</code>, which is reported in <code>qstat</code>, which grows as the job is waiting in the queue). Note that once a job is started, you cannot ''unschedule'' it, so from that moment on, it is irrelevant what was its priority.
| |
− | | |
− | <code>qsub '''-o''' LOG.stdout '''-e''' LOG.stderr</code>
| |
− | redirect std{out,err} to separate files with given names, instead of the defaults <code>$JOB_NAME.o$JOB_ID</code> and <code>$JOB_NAME.e$JOB_ID</code>.
| |
− | | |
− | <code>qsub '''-@''' optionfile</code>
| |
− | Instead of specifying all the <code>qsub</code> options on the command line, you can store them in a file (you can use # comments in the file).
| |
− | | |
− | <code>qsub '''-a''' 12312359</code>
| |
− | Execute your job no sooner than at the given time (in <code>[YY]MMDDhhmm</code> format). An alternative to <code>sleep 3600 && qsub ... &</code>.
| |
− | | |
− | <code>qsub '''-b''' y</code>
| |
− | Treat <code>script.sh</code> (or whatever is the name of the command you execute) as a binary, i.e. don't search for in-script options within the file, don't transfer it to the ''qmaster'' and then to the execution node. This makes the execution a bit faster and it may prevent some rare but hard-to-detect errors caused SGE interpreting the script. The script must be available on the execution node via Lustre (which is our case), etc. With <code>-b y</code> (shortcut for <code>-b yes</code>), <code>script.sh</code> can be a script or a binary. With <code>-b n</code> (which is the default for <code>qsub</code>), <code>script.sh</code> must be a script (text file).
| |
− | | |
− | <code>qsub '''-M''' person1@email.somewhere.cz,person2@email.somewhere.cz '''-m''' beas</code>
| |
− | Specify the emails where you want to be notified when the job has been '''b''' started, '''e''' ended, '''a''' aborted, rescheduled or '''s''' suspended.
| |
− | The default is now <code>-m a</code> and the default email address is forwarded to you (so there is no need to use '''-M'''). You can use <code>-m n</code> to override the defaults and send no emails.
| |
− | | |
− | <code>qsub '''-hold_jid''' 121144,121145</code> (or <code>qsub '''-hold_jid''' get_src.sh,get_tgt.sh</code>)
| |
− | The current job is not executed before all the specified jobs are completed.
| |
− | | |
− | <code>qsub '''-now''' y</code>
| |
− | Start the job immediately or not at all, i.e. don't put it as pending to the queue. This is the default for <code>qrsh</code>, but you can change it with <code>-now n</code> (which is the default for <code>qsub</code>).
| |
− | | |
− | <code>qsub '''-N''' my-name</code>
| |
− | By default the name of a job (which you can see e.g. in <code>qstat</code>) is the name of the <code>script.sh</code>. This way you can override it.
| |
− | | |
− | <code>qsub '''-S''' /bin/bash</code>
| |
− | The hashbang (<code>!#/bin/bash</code>) in your <code>script.sh</code> is ignored, but you can change the interpreter with ''-S''. The default interpreter is <code>/bin/bash</code>.
| |
− | | |
− | <code>qsub '''-v''' PATH[=value]</code>
| |
− | Export a given environment variable from the current shell to the job.
| |
− | | |
− | <code>qsub '''-V'''</code>
| |
− | Export all environment variables. (This is not so needed now, when bash is the default interpreter and it seems your <code>~/.bashrc</code> is always sourced.)
| |
− | | |
− | <code>qsub '''-soft''' -l ... '''-hard''' -l ... -q ...</code>
| |
− | By default, all the resource requirements (specified with <code>-l</code>) and queue requirements (specified with ''-q'') are '''hard''', i.e. your job won't be scheduled unless they can be fulfilled. You can use <code>-soft</code> to mark all following requirements as nice-to-have. And with <code>-hard</code> you can switch back to hard requirements.
| |
− | | |
− | <code>qsub '''-sync''' y</code>
| |
− | This causes qsub to wait for the job to complete before exiting (with the same exit code as the job). Useful in scripts.
| |
− | | |
− | == Memory ==
| |
− | | |
− | * There are three commonly used options for specifying memory requirements: '''mem_free''', '''act_mem_free''' and '''h_vmem'''. Each has a different purpose.
| |
− | * '''mem_free=1G''' means 1024×1024×1024 bytes, i.e. one [[https://en.wikipedia.org/wiki/Gibibyte|GiB (gibibyte)]]. '''mem_free=1g''' means 1000×1000×1000 bytes, i.e. one gigabyte. Similarly for the other options and other prefixes (k, K, m, M).
| |
− | * '''mem_free''' (or '''mf''') specifies a ''consumable resource'' tracked by SGE and it affects job scheduling. Each machine has an initial value assigned (slightly lower than the real total physical RAM capacity). When you specify <code>qsub -l mem_free=4G</code>, SGE finds a machine with '''mem_free''' >= 4GB, and subtracts 4GB from it. This limit is not enforced, so if a job exceeds this limit, ''it is not automatically killed'' and thus the SGE value of '''mem_free''' may not represent the real free memory. The default value is 1G. By not using this option and eating more than 1 GiB, you are breaking the rules.
| |
− | * '''act_mem_free''' (or '''amf''') is a ÚFAL-specific option, which specifies the real amount of free memory (at the time of scheduling). You can specify it when submitting a job and it will be scheduled to a machine with at least this amount of memory free. In an ideal world, where no jobs are exceeding their ''mem_free'' requirements, we would not need this option. However, in the real world, it is recommended to use this option with the same value as ''mem_free'' to protect your job from failing with out-of-memory error (because of naughty jobs of other users).
| |
− | * '''h_vmem''' is equivalent to setting '''ulimit -v''', i.e. it is a hard limit on the size of virtual memory (see RLIMIT_AS in <code>man setrlimit</code>). If your job exceeds this limit, memory allocation fails (i.e., malloc or mmap will return NULL), and your job will probably crash on SIGSEGV. TODO: according to <code>man queue_conf</code>, the job is killed with SIGKILL, not with SIGSEGV. Note that '''h_vmem''' specifies the maximal size of ''allocated_memory'', not ''used_memory'', in other words it is the VIRT column in <code>top</code>, not the RES column. SGE does not use this parameter in any other way. Notably, job scheduling is not affected by this parameter and therefore there is no guarantee that there will be this amount of memory on the chosen machine. The problem is that some programs (e.g. Java with the default setting) allocate much more (virtual) memory than they actually use in the end. If we want to be ultra conservative, we should set '''h_vmem''' to the same value as '''mem_free'''. If we want to be only moderately conservative, we should specify something like '''h_vmem=1.5*mem_free''', because some jobs will not use the whole mem_free requested, but still our job will be killed if it allocated much more than declared. The default effectively means that your job has no limits.
| |
− | * For GPU jobs, it is usually better to use '''h_data''' instead of '''h_vmem'''. CUDA driver allocates a lot of ''unused'' virtual memory (tens of GB per card), which is counted in '''h_vmem''', but not in '''h_data'''. All usual allocations (''malloc'', ''new'', Python allocations) seem to be included in '''h_data'''.
| |
− | * It is recommended to ''profile your task first'' (see [[#Profiling]] below, so you can estimate reasonable memory requirements before submitting many jobs with the same task (varying in parameters which do not affect memory consumption). So for the first time, declare mem_free with much more memory than expected and ssh to a given machine and check <code>htop</code> (sum all processes of your job) or (if the job is done quickly) check the epilog. When running other jobs of this type, set '''mem_free''' (and '''act_mem_free''' and '''h_vmem''') so you are not wasting resources, but still have some reserve.
| |
− | * '''s_vmem''' is similar to '''h_vmem''', but instead of SIGSEGV/SIGKILL, the job is sent a SIGXCPU signal which can be caught by the job and exit gracefully before it is killed. So if you need it, set '''s_vmem''' to a lower value than '''h_vmem''' and implement SIGXCPU handling and cleanup.
| |
− | | |
− | == GPU ==
| |
− | | |
− | The GPU part of the cluster consists of the following nodes:
| |
− | | |
− | {| class="wikitable"
| |
− | |-
| |
− | ! machine !! GPU type !! GPU driver version !! [[https://en.wikipedia.org/wiki/CUDA#GPUs_supported|cc]] !! GPU cnt !! GPU RAM (GB) !! machine RAM (GB) !! remarks
| |
− | |-
| |
− | | gpu-node1 || GeForce GTX 1080 || 418.39 || 6.1 || 2 || 8.0 || 64.0
| |
− | |-
| |
− | | gpu-node2 || GeForce GTX 1080 || 418.39 || 6.1 || 2 || 8.0 || 64.0
| |
− | |-
| |
− | | gpu-node3 || GeForce GTX 1080 || 418.39 || 6.1 || 2 || 8.0 || 64.0
| |
− | |-
| |
− | | gpu-node4 || GeForce GTX 1080 || 418.39 || 6.1 || 2 || 8.0 || 64.0
| |
− | |-
| |
− | | gpu-node5 || GeForce GTX 1080 || 418.39 || 6.1 || 2 || 8.0 || 64.0
| |
− | |-
| |
− | | gpu-node6 || GeForce GTX 1080 || 418.39 || 6.1 || 2 || 8.0 || 64.0
| |
− | |-
| |
− | | gpu-node7 || GeForce GTX 1080 || 418.39 || 6.1 || 2 || 8.0 || 64.0 || only for group '''research'''
| |
− | |-
| |
− | | gpu-node8 || GeForce GTX 1080 || 418.39 || 6.1 || 2 || 8.0 || 64.0 || only for group '''research'''
| |
− | |}
| |
− | | |
− | === Rules ===
| |
− | | |
− | * Always use GPUs via ''qsub'' (or ''qrsh''), never via ''ssh''. You can ssh to any machine e.g. to run ''nvidia-smi'' or ''htop'', but not to start computing on GPU. Don't forget to specify you RAM requirements with e.g. ''-l mem_free=8G,act_mem_free=8G,h_data=12G''.
| |
− | ** Note that you need to use ''h_data'' instead of ''h_vmem'' for GPU jobs. CUDA driver allocates a lot of "unused" virtual memory (tens of GB per card), which is counted in ''h_vmem'', but not in ''h_data''. All usual allocations (''malloc'', ''new'', Python allocations) seem to be included in ''h_data''.
| |
− | * Always specify the number of GPU cards (e.g. ''gpu=1''). Thus e.g. <code>qsub -q 'gpu*' -l gpu=1</code>
| |
− | * If you need more than one GPU card (on a single machine), always require as many CPU cores (''-pe smp X'') as many GPU cards you need. E.g. <code>qsub -q 'gpu*' -l gpu=2 -pe smp 4</code>
| |
− | * For interactive jobs, you can use ''qrsh'', but make sure to end your job as soon as you don't need the GPU (so don't use qrsh for long training). '''Warning: <code>-pty yes bash -l</code> is necessary''', otherwise the variable ''$CUDA_VISIBLE_DEVICES'' will not be set correctly. E.g. <code>qrsh -q 'gpu*' -l gpu=1 -pty yes bash -l</code>
| |
− | * In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
| |
− | * If you know an approximate runtime of your job, please specify it with ''-l s_rt=hh:mm:ss'' - this is a soft constraint so your job won't be killed if it runs longer than specified.
| |
| | | |
− | === CUDA and CUDNN === | + | -J helloWorld # name of job |
| + | --chdir /job/path/ # path where the job will be executed |
| + | -p gpu # name of partition or queue (if not specified default partition is used) |
| + | -q normal # QOS level (sets priority of the job) |
| + | -c 4 # reserve 4 CPU threads |
| + | --gres=gpu:1 # reserve 1 GPU card |
| + | -o script.out # name of output file for the job |
| + | -e script.err # name of error file for the job |
| | | |
− | Default CUDA (currently 10.1 as of Nov 2019) is available in
| + | == Array jobs == |
− | /opt/cuda
| + | If you need to submit rather large number of jobs which are similar (i.e. processing a large number of input files) you should consider launching an ''array job''. |
− | Specific version can be found in
| |
− | /lnet/aic/opt/cuda/cuda-{9.0,9.2,10.0,10.1,...}
| |
− | Depending on what version you need, you should add <code>LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/lib64:$LD_LIBRARY_PATH"</code> to your configuration.
| |
| | | |
− | CUDNN is available directly in ''lib64'' directory of the respective CUDA, so no need to configure it specifically.
| + | For example, one might need to process 1000 files named <code>file_N.txt</code> (where N is a number between 1-1000). |
| + | A program that can process one file is called <code>crunchFile</code> and it takes only one argument - the name of the file to process. Instead of calling 1000x: |
| + | sbatch crunchFile file_N.txt |
| | | |
− | == Profiling ==
| + | we can write a wrapper script <code>crunchScript.sh</code> referring to the SLURM variable <code>SLURM_ARRAY_TASK_ID</code>: |
− | As stated above, you should always specify the exact memory limits when running your tasks, so that you neither waste RAM nor starve others of memory by using more than you requested. However, memory requirements can be difficult to estimate in advance. That's why you should profile your tasks first.
| |
| | | |
− | A simple method is to run the task and observe the memory usage reported in the epilog, but SGE may not record transient allocations. As documented in <code>man 5 accounting</code> and observed in <code>qconf -sconf</code>, SGE only collects stats every '''accounting_flush_time'''. If this is not set, it defaults to '''flush_time''', which is preset to 15 seconds. But the kernel records all info immediately without polling, and you can view these exact stats by looking into <code>/proc/$PID/status</code> while the task is running.
| + | #!/bin/bash |
| + | #SBATCH -p CPU |
| + | #SBATCH --mem 2G |
| + | |
| + | crunchFile name_${SLURM_ARRAY_TASK_ID}.txt |
| | | |
− | You can still miss allocations made shortly before the program exits – which often happens when trying to debug why your program gets killed by SGE after exhausting the reserved space. To record these, use <code>/usr/bin/time -v</code> (the actual binary, not the shell-builtin command <code>time</code>). Be aware that unlike the builtin, it cannot measure shell functions and behaves differently on pipelines.
| + | and submit all the jobs at once as an ''array job'': |
| | | |
− | Obtaining peak usage of multiprocess applications is trickier. Detached and backgrounded processes are ignored completely by <code>time -v</code> and you get the maximum footprint of any children, not the sum of all maximal footprints nor the largest footprint in any instant.
| + | sbatch --array=1-1000%20 crunchScript.sh |
| | | |
− | If you program in C and need to know the peak memory usage of your children, you can also use the '''wait4()''' syscall and calculate the stats yourself.
| + | Where the option <code>--array 1-1000%20</code> means that we want SLURM to: |
| + | * launch 1000 instances of <code>crunchScript.sh</code> |
| + | * each instance will be launched with <code>SLURM_ARRAY_TASK_ID</code> set to a number in the specified range |
| + | * there will be at most 20 parallel tasks running at once. This is useful for a larger number of tasks - this way we ensure that we do not flood the cluster with requests. |
| | | |
− | If your job is the only one on a given machine, you can also look how much free memory is left when running the job (e.g. with <code>htop</code> if you know when is the peak moment).
| + | You can read more about ''array jobs'' from the [https://slurm.schedmd.com/job_array.html SLURM documentation]. |