[NLPL Task Force (A)] [uninett.no #200768] very inefficient GPU utilization on Saga by one user
Andrei Kutuzov
andreku at ifi.uio.no
Fri Dec 20 20:59:22 UTC 2019
Thanks a lot Anders!
20.12.2019 20:53, Anders Vaage via RT wrote:
> Hi Andrei,
>
> Thanks for the feedback. First of all I will get in touch with the user and send him a warning, however, I probably won't hear back from him before over the weekend.
>
> You're right that we need to monitor GPU usage more closely (which we are currently working on) and a strategy for ensuring fair usage. I will forward this through the appropriate channels.
>
> Thanks,
>
> Anders Vaage
>
> On Fri Dec 20 18:46:55 2019, andreku at ifi.uio.no wrote:
>> Hi,
>>
>> This issue has already been raised before, but the situation did not
>> change.
>>
>> Right now, again, one particular user (josece,
>> https://www.nhm.uio.no/english/about/organization/research-
>> collections/people/josece/index.html)
>> is occupying all Saga GPUs. There are now 24 active GPU jobs running
>> under this user (some for several days already) and 9 more jobs
>> pending.
>>
>> Even worse, it seems that these jobs do not actually make use of GPUs.
>> The GPU utilization on all the nodes used by josece is 0% (in fact,
>> the
>> CPU utilization is not much higher). But this still effectively
>> prohibits other users from getting access to Saga GPUs.
>>
>> I attach the list of GPU nodes queue (produced by squeue
>> --partition=accel) and the overview of actual GPU usage by josece
>> jobs.
>> It was produced by running the following commands:
>>
>> for i in $(squeue -p accel | egrep 'c[0-9]-[0-9]$' | sort -u | awk
>> '{print $NF}')
>> do
>> ssh $i nvidia-smi | grep Default
>> done
>>
>> I do not think it is a good way of using the precious GPU resource of
>> Saga.
>>
>> Can it be the case that josece jobs do not actually need GPUs and can
>> be
>> run on the CPU nodes with the same speed? Is it possible to give
>> josece
>> some feedback on that?
>>
>> Also, all Saga users will probably benefit if some limit on the
>> amount
>> of GPU jobs run by one user is imposed (at least when there are other
>> pending GPU jobs from other users).
>>
>> Thanks in advance!
>
>
>
--
Andrei
PhD Candidate at Language Technology Group (LTG)
University of Oslo
More information about the infrastructure
mailing list