Difference between revisions of "Submitting GPU Jobs"

Revision as of 11:55, 2 December 2022

Start by reading Submitting CPU Jobs page.

The GPU jobs are submitted to gpu partition.

To ask for one GPU card, use #SBATCH --gres=gpu:1 directive or --gres=gpu:1 option on the command line. The submitted job has CUDA_VISIBLE_DEVICES set appropriately, so all CUDA applications should use only the allocated GPUs.

Rules

Always use GPUs via sbatch (or srun), never via ssh. You can ssh to any machine e.g. to run nvidia-smi or htop, but not to start computing on GPU.
Don't forget to specify you RAM requirements with e.g. --mem=10G.
Always specify the number of GPU cards (e.g. --gres=gpu:1). Thus e.g. srun -p gpu --mem=64G --gres=gpu:2 --pty bash
For interactive jobs, you can use srun, but make sure to end your job as soon as you don't need the GPU (so don't use srun for long training).
In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
If you know an approximate runtime of your job, please specify it with -t . Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".

CUDA and cuDNN

Available CUDA versions are in

 /opt/cuda

CUDA modules

You can load late versions of CUDA as modules. This will set various environment variables for you so you should be able to use CUDA easily.

list available modules with: module avail
load the version you need (possibly specifying the version of CuDNN): module load <modulename>
you can unload the module with: module unload <modulename>

@@ Line 16: / Line 16: @@
 == CUDA and cuDNN ==
-Default CUDA (currently 11.2 as of Nov 2021) is available in
+Available CUDA versions are in
    /opt/cuda
-Specific version can be found in
-  /lnet/aic/opt/cuda/cuda-{9.0,9.2,10.0,10.1,10.2,11.2,...}
-Depending on what version you need, you should add <code>LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/lib64:$LD_LIBRARY_PATH"</code> to your configuration.
-Regarding cuDNN:
+=== CUDA modules ===
-* for CUDA 9.0, 9.2, 10.0 and 10.1, cuDNN is available directly in ''lib64'' directory of the respective CUDA, so no need to configure it specifically;
+You can load late versions of CUDA as modules. This will set various environment variables for you so you should be able to use CUDA easily.
-* for CUDA 10.1 and later, cuDNN is available in ''cudnn/VERSION/lib64'' subdirectory of the respective CUDA, so you need to add <code>LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/cudnn/VERSION/lib64:$LD_LIBRARY_PATH"</code> to your configuration.
+# list available modules with: <code>module avail</code>
+# load the version you need (possibly specifying the version of CuDNN): <code>module load <modulename></code>
+# you can unload the module with: <code>module unload <modulename></code>

Anonymous

Search

Difference between revisions of "Submitting GPU Jobs"

Namespaces

More

Page actions

Revision as of 11:55, 2 December 2022

Rules

CUDA and cuDNN

CUDA modules

Navigation

Navigation

MediaWiki

Wiki tools

Wiki tools

Anonymous

Search

Difference between revisions of "Submitting GPU Jobs"

Revision as of 11:55, 2 December 2022

Rules

CUDA and cuDNN

CUDA modules

Navigation

Wiki tools

Page tools