Difference between revisions of "Submitting GPU Jobs"

From UFAL AIC
Line 1: Line 1:
 
Start by reading [[Submitting CPU Jobs]] page.
 
Start by reading [[Submitting CPU Jobs]] page.
  
The GPU jobs are submitted to <code>gpu.q</code> queue.
+
The GPU jobs are submitted to <code>gpu</code> partition.
  
To ask for a GPU card, use <code>-l gpu=NUMBER_OF_REQUIRED_GPUS</code>. The submitted job has <code>CUDA_VISIBLE_DEVICES</code> set appropriately, so all CUDA applications should use only the allocated GPUs.
+
To ask for one GPU card, use <code>#SBATCH --gres=gpu:1</code> directive or <code>--gres=gpu:1</code> option on the command line. The submitted job has <code>CUDA_VISIBLE_DEVICES</code> set appropriately, so all CUDA applications should use only the allocated GPUs.
 
 
TL;DR: You can submit a non-interactive job requiring <code>%M%</code> GB RAM, <code>%C%</code> CPUs (at most 2) and <code>%G%</code> GPUs (at most 2, but see [[Quotas]]) by running
 
<pre>qsub -q gpu.q -cwd -b y -pe smp %C% -l gpu=%G%,mem_free=%M%G,act_mem_free=%M%G,h_data=%M%G path_to_binary arguments
 
</pre>
 
To submit an interactive terminal, use
 
<pre>qrsh -q gpu.q -cwd -b y -pe smp %C% -l gpu=%G%,mem_free=%M%G,act_mem_free=%M%G,h_data=%M%G -pty yes bash -l
 
</pre>
 
  
 
== Rules ==
 
== Rules ==
  
* Always use GPUs via ''qsub'' (or ''qrsh''), never via ''ssh''. You can ssh to any machine e.g. to run ''nvidia-smi'' or ''htop'', but not to start computing on GPU.
+
* Always use GPUs via ''sbatch'' (or ''srun''), never via ''ssh''. You can ssh to any machine e.g. to run ''nvidia-smi'' or ''htop'', but not to start computing on GPU.
* Don't forget to specify you RAM requirements with e.g. ''-l mem_free=8G,act_mem_free=8G,h_data=12G''.
+
* Don't forget to specify you RAM requirements with e.g. ''--mem=10G''.
** Note that you need to use ''h_data'' instead of ''h_vmem'' for GPU jobs. CUDA driver allocates a lot of "unused" virtual memory (tens of GB per card), which is counted in ''h_vmem'', but not in ''h_data''. All usual allocations (''malloc'', ''new'', Python allocations) seem to be included in ''h_data''.
+
* Always specify the number of GPU cards (e.g. ''--gres=gpu:1''). Thus e.g. <code>srun -p gpu --mem=64G --gres=gpu:2 --pty bash</code>
* Always specify the number of GPU cards (e.g. ''gpu=1''). Thus e.g. <code>qsub -q gpu.q -l gpu=1</code>
+
* For interactive jobs, you can use ''srun'', but make sure to end your job as soon as you don't need the GPU (so don't use srun for long training).
* For interactive jobs, you can use ''qrsh'', but make sure to end your job as soon as you don't need the GPU (so don't use qrsh for long training). '''Warning: <code>-pty yes bash -l</code> is necessary''', otherwise the variable ''$CUDA_VISIBLE_DEVICES'' will not be set correctly. E.g. <code>qrsh -q gpu.q -l gpu=1 -pty yes bash -l</code>
 
 
* In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
 
* In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
* If you know an approximate runtime of your job, please specify it with ''-l s_rt=hh:mm:ss'' - this is a soft constraint so your job won't be killed if it runs longer than specified.
+
* If you know an approximate runtime of your job, please specify it with ''-t <time>''. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
  
 
== CUDA and cuDNN ==
 
== CUDA and cuDNN ==
Line 33: Line 25:
 
* for CUDA 9.0, 9.2, 10.0 and 10.1, cuDNN is available directly in ''lib64'' directory of the respective CUDA, so no need to configure it specifically;
 
* for CUDA 9.0, 9.2, 10.0 and 10.1, cuDNN is available directly in ''lib64'' directory of the respective CUDA, so no need to configure it specifically;
 
* for CUDA 10.1 and later, cuDNN is available in ''cudnn/VERSION/lib64'' subdirectory of the respective CUDA, so you need to add <code>LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/cudnn/VERSION/lib64:$LD_LIBRARY_PATH"</code> to your configuration.
 
* for CUDA 10.1 and later, cuDNN is available in ''cudnn/VERSION/lib64'' subdirectory of the respective CUDA, so you need to add <code>LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/cudnn/VERSION/lib64:$LD_LIBRARY_PATH"</code> to your configuration.
 
== Available GPU Cards ==
 
 
The GPU part of the cluster consists of the following nodes:
 
{| class="wikitable"
 
|-
 
! machine !! GPU type !! GPU driver version !! [https://en.wikipedia.org/wiki/CUDA#GPUs_supported CC] !! GPU count !! GPU RAM (GB) !! CPU cores !! machine RAM (GB) !! remarks
 
|-
 
| gpu-node1 || GeForce GTX 1080 Ti ||  455.23.05 ||  6.1 ||  2 || 11.0 ||  4 ||  64.0
 
|-
 
| gpu-node2 || GeForce GTX 1080 Ti ||  455.23.05 ||  6.1 ||  2 || 11.0 ||  4 ||  64.0
 
|-
 
| gpu-node3 || GeForce GTX 1080 Ti ||  455.23.05 ||  6.1 ||  2 || 11.0 ||  4 ||  64.0
 
|-
 
| gpu-node4 || GeForce GTX 1080 Ti ||  455.23.05 ||  6.1 ||  2 || 11.0 ||  4 ||  64.0
 
|-
 
| gpu-node5 || GeForce GTX 1080 ||  455.23.05 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0
 
|-
 
| gpu-node6 || GeForce GTX 1080 ||  455.23.05 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0
 
|-
 
| gpu-node7 || GeForce GTX 1080 ||  455.23.05 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0  || only for group '''research'''
 
|-
 
| gpu-node8 || GeForce GTX 1080 ||  455.23.05 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0  || only for group '''research'''
 
|}
 

Revision as of 16:23, 16 November 2022

Start by reading Submitting CPU Jobs page.

The GPU jobs are submitted to gpu partition.

To ask for one GPU card, use #SBATCH --gres=gpu:1 directive or --gres=gpu:1 option on the command line. The submitted job has CUDA_VISIBLE_DEVICES set appropriately, so all CUDA applications should use only the allocated GPUs.

Rules

  • Always use GPUs via sbatch (or srun), never via ssh. You can ssh to any machine e.g. to run nvidia-smi or htop, but not to start computing on GPU.
  • Don't forget to specify you RAM requirements with e.g. --mem=10G.
  • Always specify the number of GPU cards (e.g. --gres=gpu:1). Thus e.g. srun -p gpu --mem=64G --gres=gpu:2 --pty bash
  • For interactive jobs, you can use srun, but make sure to end your job as soon as you don't need the GPU (so don't use srun for long training).
  • In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
  • If you know an approximate runtime of your job, please specify it with -t . Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".

CUDA and cuDNN

Default CUDA (currently 11.2 as of Nov 2021) is available in

 /opt/cuda

Specific version can be found in

 /lnet/aic/opt/cuda/cuda-{9.0,9.2,10.0,10.1,10.2,11.2,...}

Depending on what version you need, you should add LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/lib64:$LD_LIBRARY_PATH" to your configuration.

Regarding cuDNN:

  • for CUDA 9.0, 9.2, 10.0 and 10.1, cuDNN is available directly in lib64 directory of the respective CUDA, so no need to configure it specifically;
  • for CUDA 10.1 and later, cuDNN is available in cudnn/VERSION/lib64 subdirectory of the respective CUDA, so you need to add LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/cudnn/VERSION/lib64:$LD_LIBRARY_PATH" to your configuration.