[NLPL Task Force (A)] gpu utilization on Saga

Mon Nov 18 13:25:16 UTC 2019

dear colleagues,

some of our NLPL users point out that for the past several days it has
been very slow to see 'vanilla' single-gpu jobs scheduled on Saga.

just now, it appears that one user has effectively saturated the gpu
queue, but their jobs actually hardly seem to utilize the gpus
currently.  please see the attached results of the following commands

squeue -p accel > /tmp/accel
for i in $(squeue -p accel | egrep 'c[0-9]-[0-9]$' | sort -u | awk
'{print $NF}'); do \
  ssh $i nvidia-smi | grep Default; \
done > ~/nvidia-smi.log

i realize it is difficult to 'police' users, but in this specific case
i feel this colleague might benefit from some feedback on 'good' usage
patterns, and more generally i have been wondering whether the
scheduler could seek to maintain some fairness across users, i.e.
prohibit a single account from being granted the vast bulk of
available resources (while there are pending jobs by other users)?

with thanks in advance, oe
-------------- next part --------------
A non-text attachment was scrubbed...
Name: accel.log
Type: text/x-log
Size: 55005 bytes
Desc: not available
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20191118/9beed890/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nvidia-smi.log
Type: text/x-log
Size: 7680 bytes
Desc: not available
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20191118/9beed890/attachment-0001.bin>