Difference between revisions of "Submitting GPU Jobs"
(→Rules) |
|||
(15 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
Start by reading [[Submitting CPU Jobs]] page. | Start by reading [[Submitting CPU Jobs]] page. | ||
− | The GPU jobs are submitted to <code>gpu | + | The GPU jobs are submitted to <code>gpu</code> partition. |
− | To ask for | + | To ask for one GPU card, use <code>#SBATCH -G 1</code> directive or <code>-G 1</code> option on the command line. The submitted job has <code>CUDA_VISIBLE_DEVICES</code> set appropriately, so all CUDA applications should use only the allocated GPUs. |
− | |||
− | |||
== Rules == | == Rules == | ||
− | * Always use GPUs via '' | + | * Always use GPUs via ''sbatch'' (or ''srun''), never via ''ssh''. You can ssh to any machine e.g. to run ''nvidia-smi'' or ''htop'', but not to start computing on GPU. |
− | * Don't forget to specify you RAM requirements with e.g. ''- | + | * Don't forget to specify you RAM requirements with e.g. ''--mem=10G''. |
− | + | * Always specify the number of GPU cards (e.g. ''-G 1''). Thus e.g. <code>srun -p gpu --mem=64G -G 2 --pty bash</code> | |
− | * Always specify the number of GPU cards (e.g. '' | + | * For interactive jobs, you can use ''srun'', but make sure to end your job as soon as you don't need the GPU (so don't use srun for long training). |
− | * For interactive jobs, you can use '' | ||
* In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster. | * In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster. | ||
− | * If you know an approximate runtime of your job, please specify it with ''- | + | * If you know an approximate runtime of your job, please specify it with ''-t <time>''. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". |
+ | |||
+ | == CUDA and cuDNN == | ||
+ | |||
+ | Available CUDA versions are in | ||
+ | /lnet/aic/opt/cuda/ | ||
+ | and as of Apr 2023, available versions as 10.1, 10.2, 11.2, 11.7, 11.8. | ||
+ | |||
+ | The cuDNN library is also available in the subdirectory <code>cudnn/VERSION/lib64</code> of the respective CUDA directories. | ||
− | == | + | Therefore, to use CUDA 11.2 with cuDNN 8.1.1, you should add the following to your <code>.profile</code>: |
+ | export PATH="/lnet/aic/opt/cuda/cuda-11.2/bin:$PATH" | ||
+ | export LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-11.2/lib64:/lnet/aic/opt/cuda/cuda-11.2/cudnn/8.1.1/lib64:/lnet/aic/opt/cuda/cuda-11.2/extras/CUPTI/lib64:$LD_LIBRARY_PATH" | ||
+ | export XLA_FLAGS=--xla_gpu_cuda_data_dir=/lnet/aic/opt/cuda/cuda-11.2 # XLA configuration if you are using TensorFlow | ||
− | + | === CUDA modules === | |
− | + | CUDA 11.2 and later can be also loaded as modules. This will set various environment variables for you so you should be able to use CUDA easily. | |
− | |||
− | |||
− | |||
− | + | On a GPU node, you can do the following: | |
+ | # list available modules with: <code>module avail</code> | ||
+ | # load the version you need (possibly specifying the version of CuDNN): <code>module load <modulename></code> | ||
+ | # you can unload the module with: <code>module unload <modulename></code> | ||
− | + | As of Apr 2023, the available modules are | |
+ | cuda/11.2 | ||
+ | cuda/11.2-cudnn8.1 | ||
+ | cuda/11.7 | ||
+ | cuda/11.7-cudnn8.5 | ||
+ | cuda/11.8 | ||
+ | cuda/11.8-cudnn8.5 | ||
+ | cuda/11.8-cudnn8.6 | ||
+ | cuda/11.8-cudnn8.9 | ||
− | + | === List of installed GPUs === | |
+ | root@gpu-node1:~# nvidia-smi -L | ||
+ | GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-ba293e60-32f9-6907-705b-e053d1bf453b) | ||
+ | GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-b29fe79f-6192-5ece-6f91-e59d97ab304e) | ||
+ | GPU 2: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-fbcba5d0-61bd-cc4c-810e-c80cbd9cd563) | ||
+ | GPU 3: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-76ae3ae7-0a2d-ea68-3070-94c919f40169) | ||
+ | GPU 4: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-18174817-13c3-b930-1d68-37c47b41dc0b) | ||
+ | GPU 5: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-3af8a5c5-9e07-9468-e9dc-e1259f3e7890) | ||
+ | GPU 6: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-51e376db-189b-b11d-bd27-bbbb6470ff26) | ||
− | + | root@gpu-node2:~# nvidia-smi -L | |
− | + | GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-15b17780-d818-bcd2-566c-564aa1dfc38e) | |
− | + | GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-e184b0d4-7147-af43-041b-caa7f597363a) | |
− | + | GPU 2: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-ac1a453e-1c30-3fe0-e246-dd07c7645066) | |
− | + | GPU 3: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-4d19d859-d044-fdc8-17e0-e84fef4a8a13) | |
− | + | GPU 4: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-8035e3f3-76c9-124f-c5ea-d1dd4369f2a8) | |
− | + | GPU 5: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-670d0788-a048-8eef-ad1b-1eb77b18980b) | |
− | + | GPU 6: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-18d030c6-5956-f45f-7d15-ab53cffa813e) | |
− | + | GPU 7: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-f7940219-84a7-8c9c-386f-14e4043c9884) | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 11:37, 7 December 2023
Start by reading Submitting CPU Jobs page.
The GPU jobs are submitted to gpu
partition.
To ask for one GPU card, use #SBATCH -G 1
directive or -G 1
option on the command line. The submitted job has CUDA_VISIBLE_DEVICES
set appropriately, so all CUDA applications should use only the allocated GPUs.
Rules
- Always use GPUs via sbatch (or srun), never via ssh. You can ssh to any machine e.g. to run nvidia-smi or htop, but not to start computing on GPU.
- Don't forget to specify you RAM requirements with e.g. --mem=10G.
- Always specify the number of GPU cards (e.g. -G 1). Thus e.g.
srun -p gpu --mem=64G -G 2 --pty bash
- For interactive jobs, you can use srun, but make sure to end your job as soon as you don't need the GPU (so don't use srun for long training).
- In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
- If you know an approximate runtime of your job, please specify it with -t . Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
CUDA and cuDNN
Available CUDA versions are in
/lnet/aic/opt/cuda/
and as of Apr 2023, available versions as 10.1, 10.2, 11.2, 11.7, 11.8.
The cuDNN library is also available in the subdirectory cudnn/VERSION/lib64
of the respective CUDA directories.
Therefore, to use CUDA 11.2 with cuDNN 8.1.1, you should add the following to your .profile
:
export PATH="/lnet/aic/opt/cuda/cuda-11.2/bin:$PATH" export LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-11.2/lib64:/lnet/aic/opt/cuda/cuda-11.2/cudnn/8.1.1/lib64:/lnet/aic/opt/cuda/cuda-11.2/extras/CUPTI/lib64:$LD_LIBRARY_PATH" export XLA_FLAGS=--xla_gpu_cuda_data_dir=/lnet/aic/opt/cuda/cuda-11.2 # XLA configuration if you are using TensorFlow
CUDA modules
CUDA 11.2 and later can be also loaded as modules. This will set various environment variables for you so you should be able to use CUDA easily.
On a GPU node, you can do the following:
- list available modules with:
module avail
- load the version you need (possibly specifying the version of CuDNN):
module load <modulename>
- you can unload the module with:
module unload <modulename>
As of Apr 2023, the available modules are
cuda/11.2 cuda/11.2-cudnn8.1 cuda/11.7 cuda/11.7-cudnn8.5 cuda/11.8 cuda/11.8-cudnn8.5 cuda/11.8-cudnn8.6 cuda/11.8-cudnn8.9
List of installed GPUs
root@gpu-node1:~# nvidia-smi -L GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-ba293e60-32f9-6907-705b-e053d1bf453b) GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-b29fe79f-6192-5ece-6f91-e59d97ab304e) GPU 2: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-fbcba5d0-61bd-cc4c-810e-c80cbd9cd563) GPU 3: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-76ae3ae7-0a2d-ea68-3070-94c919f40169) GPU 4: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-18174817-13c3-b930-1d68-37c47b41dc0b) GPU 5: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-3af8a5c5-9e07-9468-e9dc-e1259f3e7890) GPU 6: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-51e376db-189b-b11d-bd27-bbbb6470ff26)
root@gpu-node2:~# nvidia-smi -L GPU 0: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-15b17780-d818-bcd2-566c-564aa1dfc38e) GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-e184b0d4-7147-af43-041b-caa7f597363a) GPU 2: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-ac1a453e-1c30-3fe0-e246-dd07c7645066) GPU 3: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-4d19d859-d044-fdc8-17e0-e84fef4a8a13) GPU 4: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-8035e3f3-76c9-124f-c5ea-d1dd4369f2a8) GPU 5: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-670d0788-a048-8eef-ad1b-1eb77b18980b) GPU 6: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-18d030c6-5956-f45f-7d15-ab53cffa813e) GPU 7: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-f7940219-84a7-8c9c-386f-14e4043c9884)