[NLPL Task Force (A)] [rt.uio.no #3552285] gpu usage on Abel for teaching in october

Fri Aug 30 06:31:20 UTC 2019

I am transferring the ticket to Thomas.

Regards,
Sabry.

P.S. OE, we need to do something about your software that depend on /cluster/../NLPL being available and CentOS6.8. 

On 2019-08-30 08:25:02, oe wrote:
> hi again,
> 
> one mechanism that i believe Taito users in finland appreciate is its
> separate gpu testing partition: jobs in this partition are severely
> limited
> (a maximum run time of half an hour or so) but heavily prioritized by
> the
> scheduler.  because gpu jobs typically cannot be tested on login
> nodes,
> this makes system development and debugging a lot easier.
> 
> seeing as we already anticipate fierce demand for gpu capacity, maybe
> Saga
> could implement a similar mechanism?
> 
> cheers, oe
> 
> 
> On Fri, 30 Aug 2019 at 07:38 Thomas Röblitz via RT <hpc-
> drift at usit.uio.no>
> wrote:
> 
> >
> > > On 29. Aug 2019, at 21:47, Stephan Oepen via RT <hpc-
> > > drift at usit.uio.no>
> > wrote:
> > >
> > >
> > > i
> > > am optimistically assuming that most users will have migrated to
> > > Saga
> > > by mid-september,
> >
> > Nope. More likely “mid-december”.
> >
> > > and that Abel remains operational until at least
> > > sometime into november.
> >
> > Yes. More likely Abel will enter a new decade.
> >
> > BUT, no guarantee that nn9447k is still a project on Abel and/or
> > would
> > have access to Saga & Abel. That depends on Sigma2.
> >
> > >
> > > do these assumptions sound plausible?
> >
> > See above ;)
> >
> > > if need be, do we have
> > > mechanisms in place to prevent other Abel users from saturating the
> > > gpu queue for days into the future, or otherwise making sure that
> > > shorter, one-gpu jobs get scheduled inbetween?
> >
> > Nah, then course users wouldn’t get the full cluster experience.
> > Usually
> > we don’t like to make such special arrangements particularly for such
> > a
> > long time.
> >
> > > this challenge will
> > > likely also be relevant on Saga more or less from the beginning: at
> > > least during the trial period, andrey and vinit felt that at times
> > > it
> >
> > Yeah, it was a pilot phase.
> >
> > > was near-impossible to get gpu jobs running within a couple of
> > > days,
> > > because other users had put dozens of multi-gpu jobs into the
> > > queue.
> >
> > I think, that was done on request by us, because GPUs were idling.
> > But
> > sure, I’d expect very long queues for GPUs. They are newer, more
> > performant
> > and more easy to use with the latest software packages.
> >
> > > is there any principle of fairness across users built into the
> > > scheduling decisions, i.e. make it hard for a single user to run on
> > > an
> > > overwhelmingly large proportion of a specific partition while there
> > > are pending jobs (even if submitted more recently) by other users?
> >
> > Currently, I think, there is no such policy in place. It might be
> > possible
> > to limit numbers of submitted/running jobs per account or per user
> > for a
> > partition. Since many projects will likely want to use the GPUs, a
> > fair
> > policy would probably need to implement limitations across projects,
> > i.e.,
> > max 4 submitted/running jobs per project account.
> >
> > Thomas
> >
> > >
> > > with thanks in advance, oe
> > >
> >
> >