Difference between revisions of "Submitting GPU Jobs"

Latest revision as of 13:04, 24 January 2025

Start by reading Submitting CPU Jobs page.

The GPU jobs are submitted to gpu partition.

To ask for one GPU card, use #SBATCH -G 1 directive or -G 1 option on the command line. The submitted job has CUDA_VISIBLE_DEVICES set appropriately, so all CUDA applications should use only the allocated GPUs.

Rules

Always use GPUs via sbatch (or srun), never via ssh. You can ssh to any machine e.g. to run nvidia-smi or htop, but not to start computing on GPU.
Don't forget to specify you RAM requirements with e.g. --mem=10G.
Always specify the number of GPU cards (e.g. -G 1). Thus e.g. srun -p gpu --mem=64G -G 2 --pty bash
For interactive jobs, you can use srun, but make sure to end your job as soon as you don't need the GPU (so don't use srun for long training).
In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
If you know an approximate runtime of your job, please specify it with -t . Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".

CUDA and cuDNN

Available CUDA versions are in

/lnet/aic/opt/cuda/

and as of Apr 2023, available versions as 10.1, 10.2, 11.2, 11.7, 11.8.

The cuDNN library is also available in the subdirectory cudnn/VERSION/lib64 of the respective CUDA directories.

Therefore, to use CUDA 11.2 with cuDNN 8.1.1, you should add the following to your .profile:

export PATH="/lnet/aic/opt/cuda/cuda-11.2/bin:$PATH"
export LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-11.2/lib64:/lnet/aic/opt/cuda/cuda-11.2/cudnn/8.1.1/lib64:/lnet/aic/opt/cuda/cuda-11.2/extras/CUPTI/lib64:$LD_LIBRARY_PATH"
export XLA_FLAGS=--xla_gpu_cuda_data_dir=/lnet/aic/opt/cuda/cuda-11.2 # XLA configuration if you are using TensorFlow

CUDA modules

CUDA 11.2 and later can be also loaded as modules. This will set various environment variables for you so you should be able to use CUDA easily.

On a GPU node, you can do the following:

list available modules with: module avail
load the version you need (possibly specifying the version of CuDNN): module load <modulename>
you can unload the module with: module unload <modulename>

As of Apr 2023, the available modules are

cuda/11.2
cuda/11.2-cudnn8.1
cuda/11.7
cuda/11.7-cudnn8.5
cuda/11.8
cuda/11.8-cudnn8.5
cuda/11.8-cudnn8.6
cuda/11.8-cudnn8.9

List of installed GPUs

GPU types and memory size

2080 - 11G GPU RAM
A4000 - 16G GPU RAM
3090 - 24G GPU RAM

root@gpu-node1:~# nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-5b111b2e-ff0d-25f7-2e08-f4065c510832)
GPU 1: NVIDIA RTX A4000 (UUID: GPU-9e4fa6ca-e3fa-d404-eac2-026295fbd076)
GPU 2: NVIDIA RTX A4000 (UUID: GPU-189c4e93-0ebe-2c7b-aa61-270d08db5a9c)
GPU 3: NVIDIA RTX A4000 (UUID: GPU-2f06bc8b-0ef4-6bd9-4385-69c76d73daae)
GPU 4: NVIDIA RTX A4000 (UUID: GPU-818b6a31-6d23-39a5-2139-c2c6c8a1174e)
GPU 5: NVIDIA GeForce RTX 3090 (UUID: GPU-ba293e60-32f9-6907-705b-e053d1bf453b)
GPU 6: NVIDIA RTX A4000 (UUID: GPU-edbbd8a2-f618-070b-8fce-b9a5fa10ccb2)
GPU 7: NVIDIA RTX A4000 (UUID: GPU-82956c50-ec17-6fb7-7898-d3c920c1b1f7)
GPU 8: NVIDIA RTX A4000 (UUID: GPU-ae27887e-5198-ebc6-fa9c-1f4b66d91b46)

root@gpu-node2:~# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-15b17780-d818-bcd2-566c-564aa1dfc38e)
GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-e184b0d4-7147-af43-041b-caa7f597363a)
GPU 2: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-ac1a453e-1c30-3fe0-e246-dd07c7645066)
GPU 3: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-4d19d859-d044-fdc8-17e0-e84fef4a8a13)
GPU 4: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-8035e3f3-76c9-124f-c5ea-d1dd4369f2a8)
GPU 5: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-670d0788-a048-8eef-ad1b-1eb77b18980b)
GPU 6: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-18d030c6-5956-f45f-7d15-ab53cffa813e)
GPU 7: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-f7940219-84a7-8c9c-386f-14e4043c9884)

@@ Line 1: / Line 1: @@
 Start by reading [[Submitting CPU Jobs]] page.
-The GPU jobs are submitted to <code>gpu.q</code> queue.
+The GPU jobs are submitted to <code>gpu</code> partition.
-To ask for a GPU card, use <code>-l gpu=NUMBER_OF_REQUIRED_GPUS</code>. The submitted job has <code>CUDA_VISIBLE_DEVICES</code> set appropriately, so all CUDA applications should use only the allocated GPUs.
+To ask for one GPU card, use <code>#SBATCH -G 1</code> directive or <code>-G 1</code> option on the command line. The submitted job has <code>CUDA_VISIBLE_DEVICES</code> set appropriately, so all CUDA applications should use only the allocated GPUs.
-Note that you can ask a different number of CPU cores (<code>-pe smp CPU_CORES</code>) and GPUs (<code>-l gpu=GPUS</code>).
 == Rules ==
-* Always use GPUs via ''qsub'' (or ''qrsh''), never via ''ssh''. You can ssh to any machine e.g. to run ''nvidia-smi'' or ''htop'', but not to start computing on GPU.
+* Always use GPUs via ''sbatch'' (or ''srun''), never via ''ssh''. You can ssh to any machine e.g. to run ''nvidia-smi'' or ''htop'', but not to start computing on GPU.
-* Don't forget to specify you RAM requirements with e.g. ''-l mem_free=8G,act_mem_free=8G,h_data=12G''.
+* Don't forget to specify you RAM requirements with e.g. ''--mem=10G''.
-** Note that you need to use ''h_data'' instead of ''h_vmem'' for GPU jobs. CUDA driver allocates a lot of "unused" virtual memory (tens of GB per card), which is counted in ''h_vmem'', but not in ''h_data''. All usual allocations (''malloc'', ''new'', Python allocations) seem to be included in ''h_data''.
+* Always specify the number of GPU cards (e.g. ''-G 1''). Thus e.g. <code>srun -p gpu --mem=64G -G 2 --pty bash</code>
-* Always specify the number of GPU cards (e.g. ''gpu=1''). Thus e.g. <code>qsub -q 'gpu*' -l gpu=1</code>
+* For interactive jobs, you can use ''srun'', but make sure to end your job as soon as you don't need the GPU (so don't use srun for long training).
-* For interactive jobs, you can use ''qrsh'', but make sure to end your job as soon as you don't need the GPU (so don't use qrsh for long training). '''Warning: <code>-pty yes bash -l</code> is necessary''', otherwise the variable ''$CUDA_VISIBLE_DEVICES'' will not be set correctly. E.g. <code>qrsh -q 'gpu*' -l gpu=1 -pty yes bash -l</code>
 * In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
-* If you know an approximate runtime of your job, please specify it with ''-l s_rt=hh:mm:ss'' - this is a soft constraint so your job won't be killed if it runs longer than specified.
+* If you know an approximate runtime of your job, please specify it with ''-t <time>''. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
+== CUDA and cuDNN ==
+Available CUDA versions are in
+ /lnet/aic/opt/cuda/
+and as of Apr 2023, available versions as 10.1, 10.2, 11.2, 11.7, 11.8.
+The cuDNN library is also available in the subdirectory <code>cudnn/VERSION/lib64</code> of the respective CUDA directories.
+Therefore, to use CUDA 11.2 with cuDNN 8.1.1, you should add the following to your <code>.profile</code>:
+ export PATH="/lnet/aic/opt/cuda/cuda-11.2/bin:$PATH"
+ export LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-11.2/lib64:/lnet/aic/opt/cuda/cuda-11.2/cudnn/8.1.1/lib64:/lnet/aic/opt/cuda/cuda-11.2/extras/CUPTI/lib64:$LD_LIBRARY_PATH"
+ export XLA_FLAGS=--xla_gpu_cuda_data_dir=/lnet/aic/opt/cuda/cuda-11.2 # XLA configuration if you are using TensorFlow
+=== CUDA modules ===
+CUDA 11.2 and later can be also loaded as modules. This will set various environment variables for you so you should be able to use CUDA easily.
+On a GPU node, you can do the following:
+# list available modules with: <code>module avail</code>
+# load the version you need (possibly specifying the version of CuDNN): <code>module load <modulename></code>
+# you can unload the module with: <code>module unload <modulename></code>
-== CUDA and CUDNN ==
+As of Apr 2023, the available modules are
+ cuda/11.2
+ cuda/11.2-cudnn8.1
+ cuda/11.7
+ cuda/11.7-cudnn8.5
+ cuda/11.8
+ cuda/11.8-cudnn8.5
+ cuda/11.8-cudnn8.6
+ cuda/11.8-cudnn8.9
-Default CUDA (currently 10.1 as of Nov 2019) is available in
+=== List of installed GPUs ===
-  /opt/cuda
-Specific version can be found in
-  /lnet/aic/opt/cuda/cuda-{9.0,9.2,10.0,10.1,...}
-Depending on what version you need, you should add <code>LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/lib64:$LD_LIBRARY_PATH"</code> to your configuration.
-CUDNN is available directly in ''lib64'' directory of the respective CUDA, so no need to configure it specifically.
+===== GPU types and memory size =====
+* 2080 - 11G GPU RAM
+* A4000 - 16G GPU RAM
+* 3090 - 24G GPU RAM
-== Available GPU Cards ==
+ root@gpu-node1:~# nvidia-smi -L
+ GPU 0: NVIDIA RTX A4000 (UUID: GPU-5b111b2e-ff0d-25f7-2e08-f4065c510832)
+ GPU 1: NVIDIA RTX A4000 (UUID: GPU-9e4fa6ca-e3fa-d404-eac2-026295fbd076)
+ GPU 2: NVIDIA RTX A4000 (UUID: GPU-189c4e93-0ebe-2c7b-aa61-270d08db5a9c)
+ GPU 3: NVIDIA RTX A4000 (UUID: GPU-2f06bc8b-0ef4-6bd9-4385-69c76d73daae)
+ GPU 4: NVIDIA RTX A4000 (UUID: GPU-818b6a31-6d23-39a5-2139-c2c6c8a1174e)
+ GPU 5: NVIDIA GeForce RTX 3090 (UUID: GPU-ba293e60-32f9-6907-705b-e053d1bf453b)
+ GPU 6: NVIDIA RTX A4000 (UUID: GPU-edbbd8a2-f618-070b-8fce-b9a5fa10ccb2)
+ GPU 7: NVIDIA RTX A4000 (UUID: GPU-82956c50-ec17-6fb7-7898-d3c920c1b1f7)
+ GPU 8: NVIDIA RTX A4000 (UUID: GPU-ae27887e-5198-ebc6-fa9c-1f4b66d91b46)
-The GPU part of the cluster consists of the following nodes:
+ root@gpu-node2:~# nvidia-smi -L
-{| class="wikitable"
+ GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-15b17780-d818-bcd2-566c-564aa1dfc38e)
-|-
+ GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-e184b0d4-7147-af43-041b-caa7f597363a)
-! machine !! GPU type !! GPU driver version !! [https://en.wikipedia.org/wiki/CUDA#GPUs_supported CC] !! GPU count !! GPU RAM (GB) !! CPU cores !! machine RAM (GB) !! remarks
+ GPU 2: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-ac1a453e-1c30-3fe0-e246-dd07c7645066)
-|-
+ GPU 3: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-4d19d859-d044-fdc8-17e0-e84fef4a8a13)
-| gpu-node1 || GeForce GTX 1080 ||  418.39 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0
+  GPU 4: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-8035e3f3-76c9-124f-c5ea-d1dd4369f2a8)
-|-
+  GPU 5: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-670d0788-a048-8eef-ad1b-1eb77b18980b)
-| gpu-node2 || GeForce GTX 1080 ||  418.39 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0
+  GPU 6: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-18d030c6-5956-f45f-7d15-ab53cffa813e)
-|-
+ GPU 7: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-f7940219-84a7-8c9c-386f-14e4043c9884)
-| gpu-node3 || GeForce GTX 1080 ||  418.39 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0
-|-
-| gpu-node4 || GeForce GTX 1080 ||  418.39 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0
-|-
-| gpu-node5 || GeForce GTX 1080 ||  418.39 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0
-|-
-| gpu-node6 || GeForce GTX 1080 ||  418.39 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0
-|-
-| gpu-node7 || GeForce GTX 1080 ||  418.39 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0  || only for group '''research'''
-|-
-| gpu-node8 || GeForce GTX 1080 ||  418.39 ||  6.1 ||  2 ||  8.0 ||  4 ||  64.0  || only for group '''research'''
-|}

Anonymous

Search

Difference between revisions of "Submitting GPU Jobs"

Namespaces

More

Page actions

Latest revision as of 13:04, 24 January 2025

Contents

Rules

CUDA and cuDNN

CUDA modules

List of installed GPUs

GPU types and memory size

Navigation

Navigation

MediaWiki

Wiki tools

Wiki tools

Anonymous

Search

Difference between revisions of "Submitting GPU Jobs"

Latest revision as of 13:04, 24 January 2025

Contents

Rules

CUDA and cuDNN

CUDA modules

List of installed GPUs

GPU types and memory size

Navigation

Wiki tools

Page tools