[NLPL Task Force (A)] [uninett.no #192231] GPU usage patterns

Wed Aug 14 12:41:59 UTC 2019

Hi Vinit

Thanks for your feedback. I will forward this through the appropriate channels
for consideration. Queue system tuning is an ongoing process, and we will
always try to optimize for the maximum benefit to our users.

In this case I suspect that perhaps it was a case of many of the GPUs being
available at the same time, and so a user was able to start a job on a large
number of them at the same time. The only remedy to such a situation is to
limit the maximum number of GPU jobs a user can run at any one time, which is
worth considering (but obviously not optimal in the case that there are
available GPUs and only one user requesting them).

Again, I'll bring this to the attention of the right people so we can consider
the options, however we will likely need to see what the situation is like
during production loads before deciding on a final strategy.

Best regards,
Andreas

On Wed Aug 14 14:11:11 2019, vinitr at ifi.uio.no wrote:

    Hi all,

    I had three jobs queued up for GPU time on Saga yesterday, and I'd been waiting quite a while (~22 hours) before investigating - turns out a single user has been allocated virtually all the free GPUs for 2 days. This is obviously extremely inconvenient - I submitted my jobs on a Tuesday, which means I’d have to wait till Thursday for their jobs to terminate, which would mean I’d get my results over the weekend.

    Now this obviously isn’t the user’s fault, because it is quite convenient to queue a lot of jobs and leave them to run - I often do this myself. However, given how people’s work patterns on GPUs have drastically changed over the past few years, and given that Saga has a relatively limited number of GPUs - would it be possible to have a fairer queuing and scheduling system that wouldn’t instantly allocate all free resources for long stretches of time? Thanks!

    Best

    – Vinit