Difference between revisions of "Submitting GPU Jobs"

Revision as of 11:06, 29 January 2020

Start by reading Submitting CPU Jobs page.

The GPU jobs are submitted to gpu.q queue.

To ask for a GPU card, use -l gpu=NUMBER_OF_REQUIRED_GPUS.

Note that you can ask a different number of CPU cores (-pe smp CPU_CORES) and GPUs (-l gpu=GPUS).

Rules

Always use GPUs via qsub (or qrsh), never via ssh. You can ssh to any machine e.g. to run nvidia-smi or htop, but not to start computing on GPU. Don't forget to specify you RAM requirements with e.g. -l mem_free=8G,act_mem_free=8G,h_data=12G.
- Note that you need to use h_data instead of h_vmem for GPU jobs. CUDA driver allocates a lot of "unused" virtual memory (tens of GB per card), which is counted in h_vmem, but not in h_data. All usual allocations (malloc, new, Python allocations) seem to be included in h_data.
Always specify the number of GPU cards (e.g. gpu=1). Thus e.g. qsub -q 'gpu*' -l gpu=1
For interactive jobs, you can use qrsh, but make sure to end your job as soon as you don't need the GPU (so don't use qrsh for long training). Warning: -pty yes bash -l is necessary, otherwise the variable $CUDA_VISIBLE_DEVICES will not be set correctly. E.g. qrsh -q 'gpu*' -l gpu=1 -pty yes bash -l
In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
If you know an approximate runtime of your job, please specify it with -l s_rt=hh:mm:ss - this is a soft constraint so your job won't be killed if it runs longer than specified.

CUDA and CUDNN

Default CUDA (currently 10.1 as of Nov 2019) is available in

 /opt/cuda

Specific version can be found in

 /lnet/aic/opt/cuda/cuda-{9.0,9.2,10.0,10.1,...}

Depending on what version you need, you should add LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/lib64:$LD_LIBRARY_PATH" to your configuration.

CUDNN is available directly in lib64 directory of the respective CUDA, so no need to configure it specifically.

Available GPU Cards

The GPU part of the cluster consists of the following nodes:

machine	GPU type	GPU driver version	[[1]]	GPU cnt	GPU RAM (GB)	machine RAM (GB)	remarks
gpu-node1	GeForce GTX 1080	418.39	6.1	2	8.0	64.0
gpu-node2	GeForce GTX 1080	418.39	6.1	2	8.0	64.0
gpu-node3	GeForce GTX 1080	418.39	6.1	2	8.0	64.0
gpu-node4	GeForce GTX 1080	418.39	6.1	2	8.0	64.0
gpu-node5	GeForce GTX 1080	418.39	6.1	2	8.0	64.0
gpu-node6	GeForce GTX 1080	418.39	6.1	2	8.0	64.0
gpu-node7	GeForce GTX 1080	418.39	6.1	2	8.0	64.0	only for group research
gpu-node8	GeForce GTX 1080	418.39	6.1	2	8.0	64.0	only for group research

@@ Line 2: / Line 2: @@
 The GPU jobs are submitted to <code>gpu.q</code> queue.
+To ask for a GPU card, use <code>-l gpu=NUMBER_OF_REQUIRED_GPUS</code>.
+Note that you can ask a different number of CPU cores (<code>-pe smp CPU_CORES</code>) and GPUs (<code>-l gpu=GPUS</code>).
+== Rules ==
+* Always use GPUs via ''qsub'' (or ''qrsh''), never via ''ssh''. You can ssh to any machine e.g. to run ''nvidia-smi'' or ''htop'', but not to start computing on GPU. Don't forget to specify you RAM requirements with e.g. ''-l mem_free=8G,act_mem_free=8G,h_data=12G''.
+** Note that you need to use ''h_data'' instead of ''h_vmem'' for GPU jobs. CUDA driver allocates a lot of "unused" virtual memory (tens of GB per card), which is counted in ''h_vmem'', but not in ''h_data''. All usual allocations (''malloc'', ''new'', Python allocations) seem to be included in ''h_data''.
+* Always specify the number of GPU cards (e.g. ''gpu=1''). Thus e.g. <code>qsub -q 'gpu*' -l gpu=1</code>
+* For interactive jobs, you can use ''qrsh'', but make sure to end your job as soon as you don't need the GPU (so don't use qrsh for long training). '''Warning: <code>-pty yes bash -l</code> is necessary''', otherwise the variable ''$CUDA_VISIBLE_DEVICES'' will not be set correctly. E.g. <code>qrsh -q 'gpu*' -l gpu=1 -pty yes bash -l</code>
+* In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
+* If you know an approximate runtime of your job, please specify it with ''-l s_rt=hh:mm:ss'' - this is a soft constraint so your job won't be killed if it runs longer than specified.
+== CUDA and CUDNN ==
+Default CUDA (currently 10.1 as of Nov 2019) is available in
+  /opt/cuda
+Specific version can be found in
+  /lnet/aic/opt/cuda/cuda-{9.0,9.2,10.0,10.1,...}
+Depending on what version you need, you should add <code>LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/lib64:$LD_LIBRARY_PATH"</code> to your configuration.
+CUDNN is available directly in ''lib64'' directory of the respective CUDA, so no need to configure it specifically.
 == Available GPU Cards ==
@@ Line 27: / Line 50: @@
 | gpu-node8 || GeForce GTX 1080 ||  418.39 ||  6.1 ||  2 ||  8.0 ||  64.0  || only for group '''research'''
 |}
-== Rules ==
-* Always use GPUs via ''qsub'' (or ''qrsh''), never via ''ssh''. You can ssh to any machine e.g. to run ''nvidia-smi'' or ''htop'', but not to start computing on GPU. Don't forget to specify you RAM requirements with e.g. ''-l mem_free=8G,act_mem_free=8G,h_data=12G''.
-** Note that you need to use ''h_data'' instead of ''h_vmem'' for GPU jobs. CUDA driver allocates a lot of "unused" virtual memory (tens of GB per card), which is counted in ''h_vmem'', but not in ''h_data''. All usual allocations (''malloc'', ''new'', Python allocations) seem to be included in ''h_data''.
-* Always specify the number of GPU cards (e.g. ''gpu=1''). Thus e.g. <code>qsub -q 'gpu*' -l gpu=1</code>
-* If you need more than one GPU card (on a single machine), always require as many CPU cores (''-pe smp X'') as many GPU cards you need. E.g. <code>qsub -q 'gpu*' -l gpu=2 -pe smp 4</code>
-* For interactive jobs, you can use ''qrsh'', but make sure to end your job as soon as you don't need the GPU (so don't use qrsh for long training). '''Warning: <code>-pty yes bash -l</code> is necessary''', otherwise the variable ''$CUDA_VISIBLE_DEVICES'' will not be set correctly. E.g. <code>qrsh -q 'gpu*' -l gpu=1 -pty yes bash -l</code>
-* In general: don't reserve a GPU (as described above) without actually using it for longer time, e.g., try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.
-* If you know an approximate runtime of your job, please specify it with ''-l s_rt=hh:mm:ss'' - this is a soft constraint so your job won't be killed if it runs longer than specified.
-== CUDA and CUDNN ==
-Default CUDA (currently 10.1 as of Nov 2019) is available in
-  /opt/cuda
-Specific version can be found in
-  /lnet/aic/opt/cuda/cuda-{9.0,9.2,10.0,10.1,...}
-Depending on what version you need, you should add <code>LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-X.Y/lib64:$LD_LIBRARY_PATH"</code> to your configuration.
-CUDNN is available directly in ''lib64'' directory of the respective CUDA, so no need to configure it specifically.

Anonymous

Search

Difference between revisions of "Submitting GPU Jobs"

Namespaces

More

Page actions

Revision as of 11:06, 29 January 2020

Rules

CUDA and CUDNN

Available GPU Cards

Navigation

Navigation

MediaWiki

Wiki tools

Wiki tools

Anonymous

Search

Difference between revisions of "Submitting GPU Jobs"

Revision as of 11:06, 29 January 2020

Rules

CUDA and CUDNN

Available GPU Cards

Navigation

Wiki tools

Page tools