[NLPL Task Force (A)] [uninett.no #198531] gpu utilization on Saga

Tue Nov 19 08:14:14 UTC 2019

thanks, nikolay!  oe

On Tue, 19 Nov 2019 at 09:12 nikolaiv at uio.no via RT <support at metacenter.no>
wrote:

>
> Shall be OK now.
>
> On Mon Nov 18 14:25:38 2019, oe at ifi.uio.no wrote:
>
>     dear colleagues,
>
>     some of our NLPL users point out that for the past several days it has
>     been very slow to see 'vanilla' single-gpu jobs scheduled on Saga.
>
>     just now, it appears that one user has effectively saturated the gpu
>     queue, but their jobs actually hardly seem to utilize the gpus
>     currently.  please see the attached results of the following commands
>
>     squeue -p accel > /tmp/accel
>     for i in $(squeue -p accel | egrep 'c[0-9]-[0-9]$' | sort -u | awk
>     '{print $NF}'); do \
>       ssh $i nvidia-smi | grep Default; \
>     done > ~/nvidia-smi.log
>
>     i realize it is difficult to 'police' users, but in this specific case
>     i feel this colleague might benefit from some feedback on 'good' usage
>     patterns, and more generally i have been wondering whether the
>     scheduler could seek to maintain some fairness across users, i.e.
>     prohibit a single account from being granted the vast bulk of
>     available resources (while there are pending jobs by other users)?
>
>     with thanks in advance, oe
>
>
>
>
>