Submitting GPU Jobs

From UFAL AIC
Revision as of 00:43, 4 November 2021 by Straka (talk | contribs)

Start by reading Submitting CPU Jobs page.

The GPU jobs are submitted to gpu.q queue.

To ask for a GPU card, use -l gpu=NUMBER_OF_REQUIRED_GPUS. The submitted job has CUDA_VISIBLE_DEVICES set appropriately, so all CUDA applications should use only the allocated GPUs.

TL;DR: You can submit a non-interactive job requiring %M% GB RAM, %C% CPUs (at most 2) and %G% GPUs (at most 2, but see Quotas) by running

qsub -q gpu.q -cwd -b y -pe smp %C% -l gpu=%G%,mem_free=%M%G,act_mem_free=%M%G,h_data=%M%G path_to_binary arguments

To submit an interactive terminal, use

qrsh -q gpu.q -cwd -b y -pe smp %C% -l gpu=%G%,mem_free=%M%G,act_mem_free=%M%G,h_data=%M%G -pty yes bash -l

Rules

  • Always use GPUs via qsub (or qrsh), never via ssh. You can ssh to any machine e.g. to run nvidia-smi or htop, but not to start computing on GPU.
  • Don't forget to specify you RAM requirements with e.g. -l mem_free=8G,act_mem_free=8G,h_data=12G.
    • Note that you need to use h_data instead of h_vmem for GPU jobs. CUDA driver allocates a lot of "unused" virtual memory (tens of GB per card), which is counted in h_vmem, but not in h_data. All usual allocations (malloc, new, Python allocations) seem to be included in h_data.
  • Always specify the number of GPU cards (e.g. gpu=1). Thus e.g. qsub -q gpu.q -l gpu=1
  • For interactive jobs, you can use qrsh, but make sure to end your job as soon as you don't need the GPU (so don't use qrsh for long training). Warning: -pty yes bash -l is necessary, otherwise the variable $CUDA_VISIBLE_DEVICES will not be set correctly. E.g. qrsh -q gpu.q -l gpu=1 -pty yes bash -l
  • In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
  • If you know an approximate runtime of your job, please specify it with -l s_rt=hh:mm:ss - this is a soft constraint so your job won't be killed if it runs longer than specified.

CUDA and cuDNN

Default CUDA (currently 12.2 as of Nov 2019) is available in

 /opt/cuda

Specific version can be found in

 /lnet/aic/opt/cuda/cuda-{9.0,9.2,10.0,10.1,10.2,11.2,...}

Depending on what version you need, you should add LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/lib64:$LD_LIBRARY_PATH" to your configuration.

Regarding cuDNN:

  • for CUDA 9.0, 9.2, 10.0 and 10.1, cuDNN is available directly in lib64 directory of the respective CUDA, so no need to configure it specifically;
  • for CUDA 10.2 and later, cuDNN is available in cudnn/VERSION/lib64 subdirectory of the respective CUDA, so you need to add LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/cudnn/VERSION/lib64:$LD_LIBRARY_PATH" to your configuration.

Available GPU Cards

The GPU part of the cluster consists of the following nodes:

machine GPU type GPU driver version CC GPU count GPU RAM (GB) CPU cores machine RAM (GB) remarks
gpu-node1 GeForce GTX 1080 418.39 6.1 2 8.0 4 64.0
gpu-node2 GeForce GTX 1080 418.39 6.1 2 8.0 4 64.0
gpu-node3 GeForce GTX 1080 418.39 6.1 2 8.0 4 64.0
gpu-node4 GeForce GTX 1080 418.39 6.1 2 8.0 4 64.0
gpu-node5 GeForce GTX 1080 418.39 6.1 2 8.0 4 64.0
gpu-node6 GeForce GTX 1080 418.39 6.1 2 8.0 4 64.0
gpu-node7 GeForce GTX 1080 418.39 6.1 2 8.0 4 64.0 only for group research
gpu-node8 GeForce GTX 1080 418.39 6.1 2 8.0 4 64.0 only for group research