[NLPL Task Force (A)] [rt.uio.no #3552285] gpu usage on Abel for teaching in october

Mon Sep 2 22:39:45 UTC 2019

> seeing as we already anticipate fierce demand for gpu capacity, maybe
> Saga
> could implement a similar mechanism?

Maybe, maybe not. Traditionally the GPUs on Abel (and Saga replaces Abel) are used as accelerators for existing applications. For such uses there is not much need for development and testing. If you develop your codes, then you might be better off with another platform/resource.

Questions for Saga should be directed at support at metacenter.no

Closing case

Thomas

> 
> cheers, oe
> 
> 
> On Fri, 30 Aug 2019 at 07:38 Thomas Röblitz via RT <hpc-
> drift at usit.uio.no>
> wrote:
> 
> >
> > > On 29. Aug 2019, at 21:47, Stephan Oepen via RT <hpc-
> > > drift at usit.uio.no>
> > wrote:
> > >
> > >
> > > i
> > > am optimistically assuming that most users will have migrated to
> > > Saga
> > > by mid-september,
> >
> > Nope. More likely “mid-december”.
> >
> > > and that Abel remains operational until at least
> > > sometime into november.
> >
> > Yes. More likely Abel will enter a new decade.
> >
> > BUT, no guarantee that nn9447k is still a project on Abel and/or
> > would
> > have access to Saga & Abel. That depends on Sigma2.
> >
> > >
> > > do these assumptions sound plausible?
> >
> > See above ;)
> >
> > > if need be, do we have
> > > mechanisms in place to prevent other Abel users from saturating the
> > > gpu queue for days into the future, or otherwise making sure that
> > > shorter, one-gpu jobs get scheduled inbetween?
> >
> > Nah, then course users wouldn’t get the full cluster experience.
> > Usually
> > we don’t like to make such special arrangements particularly for such
> > a
> > long time.
> >
> > > this challenge will
> > > likely also be relevant on Saga more or less from the beginning: at
> > > least during the trial period, andrey and vinit felt that at times
> > > it
> >
> > Yeah, it was a pilot phase.
> >
> > > was near-impossible to get gpu jobs running within a couple of
> > > days,
> > > because other users had put dozens of multi-gpu jobs into the
> > > queue.
> >
> > I think, that was done on request by us, because GPUs were idling.
> > But
> > sure, I’d expect very long queues for GPUs. They are newer, more
> > performant
> > and more easy to use with the latest software packages.
> >
> > > is there any principle of fairness across users built into the
> > > scheduling decisions, i.e. make it hard for a single user to run on
> > > an
> > > overwhelmingly large proportion of a specific partition while there
> > > are pending jobs (even if submitted more recently) by other users?
> >
> > Currently, I think, there is no such policy in place. It might be
> > possible
> > to limit numbers of submitted/running jobs per account or per user
> > for a
> > partition. Since many projects will likely want to use the GPUs, a
> > fair
> > policy would probably need to implement limitations across projects,
> > i.e.,
> > max 4 submitted/running jobs per project account.
> >
> > Thomas
> >
> > >
> > > with thanks in advance, oe
> > >
> >
> >