[NLPL Task Force (A)] [uninett.no #198531] gpu utilization on Saga

Tue Nov 19 08:11:38 UTC 2019

Shall be OK now.

On Mon Nov 18 14:25:38 2019, oe at ifi.uio.no wrote:

    dear colleagues,

    some of our NLPL users point out that for the past several days it has
    been very slow to see 'vanilla' single-gpu jobs scheduled on Saga.

    just now, it appears that one user has effectively saturated the gpu
    queue, but their jobs actually hardly seem to utilize the gpus
    currently.  please see the attached results of the following commands

    squeue -p accel > /tmp/accel
    for i in $(squeue -p accel | egrep 'c[0-9]-[0-9]$' | sort -u | awk
    '{print $NF}'); do \
      ssh $i nvidia-smi | grep Default; \
    done > ~/nvidia-smi.log

    i realize it is difficult to 'police' users, but in this specific case
    i feel this colleague might benefit from some feedback on 'good' usage
    patterns, and more generally i have been wondering whether the
    scheduler could seek to maintain some fairness across users, i.e.
    prohibit a single account from being granted the vast bulk of
    available resources (while there are pending jobs by other users)?

    with thanks in advance, oe