[NLPL Task Force (A)] [rt.uio.no #3552285] gpu usage on Abel for teaching in october

Fri Aug 30 06:24:47 UTC 2019

hi again,

one mechanism that i believe Taito users in finland appreciate is its
separate gpu testing partition: jobs in this partition are severely limited
(a maximum run time of half an hour or so) but heavily prioritized by the
scheduler.  because gpu jobs typically cannot be tested on login nodes,
this makes system development and debugging a lot easier.

seeing as we already anticipate fierce demand for gpu capacity, maybe Saga
could implement a similar mechanism?

cheers, oe

On Fri, 30 Aug 2019 at 07:38 Thomas Röblitz via RT <hpc-drift at usit.uio.no>
wrote:

>
> > On 29. Aug 2019, at 21:47, Stephan Oepen via RT <hpc-drift at usit.uio.no>
> wrote:
> >
> >
> > i
> > am optimistically assuming that most users will have migrated to Saga
> > by mid-september,
>
> Nope. More likely “mid-december”.
>
> > and that Abel remains operational until at least
> > sometime into november.
>
> Yes. More likely Abel will enter a new decade.
>
> BUT, no guarantee that nn9447k is still a project on Abel and/or would
> have access to Saga & Abel. That depends on Sigma2.
>
> >
> > do these assumptions sound plausible?
>
> See above ;)
>
> > if need be, do we have
> > mechanisms in place to prevent other Abel users from saturating the
> > gpu queue for days into the future, or otherwise making sure that
> > shorter, one-gpu jobs get scheduled inbetween?
>
> Nah, then course users wouldn’t get the full cluster experience. Usually
> we don’t like to make such special arrangements particularly for such a
> long time.
>
> > this challenge will
> > likely also be relevant on Saga more or less from the beginning: at
> > least during the trial period, andrey and vinit felt that at times it
>
> Yeah, it was a pilot phase.
>
> > was near-impossible to get gpu jobs running within a couple of days,
> > because other users had put dozens of multi-gpu jobs into the queue.
>
> I think, that was done on request by us, because GPUs were idling. But
> sure, I’d expect very long queues for GPUs. They are newer, more performant
> and more easy to use with the latest software packages.
>
> > is there any principle of fairness across users built into the
> > scheduling decisions, i.e. make it hard for a single user to run on an
> > overwhelmingly large proportion of a specific partition while there
> > are pending jobs (even if submitted more recently) by other users?
>
> Currently, I think, there is no such policy in place. It might be possible
> to limit numbers of submitted/running jobs per account or per user for a
> partition. Since many projects will likely want to use the GPUs, a fair
> policy would probably need to implement limitations across projects, i.e.,
> max 4 submitted/running jobs per project account.
>
> Thomas
>
> >
> > with thanks in advance, oe
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20190830/678d03a6/attachment.htm>