[NLPL Task Force (A)] [rt.uio.no #3552285] gpu usage on Abel for teaching in october

Thu Aug 29 20:27:49 UTC 2019

Hello,

On 2019-08-29 21:47:36, oe wrote:
> colleagues,
> 
> under the NLPL umbrella, we are planning a series of lab assignments
> on neural machine translation (NMT) between late september and
> mid-november.  there will likely be around ten student teams who each
> will want to run frequent multi-hour jobs on one gpu for those weeks.
> in principle, the Abel hardware would be fully sufficient for this
> purpose, but we would somehow need to make sure that a non-trivial
> fraction of the available gpu capacity will actually be available.  i
> am optimistically assuming that most users will have migrated to Saga
> by mid-september, and that Abel remains operational until at least
> sometime into november.

If the migration is complete there is a possibility that Abel will 
be shut down sooner. So I would recommend not to plan to use Abel after
October.

> 
> do these assumptions sound plausible?  if need be, do we have
> mechanisms in place to prevent other Abel users from saturating the
> gpu queue for days into the future, or otherwise making sure that
> shorter, one-gpu jobs get scheduled inbetween?  this challenge will
> likely also be relevant on Saga more or less from the beginning: at
> least during the trial period, andrey and vinit felt that at times it
> was near-impossible to get gpu jobs running within a couple of days,
> because other users had put dozens of multi-gpu jobs into the queue.
> is there any principle of fairness across users built into the
> scheduling decisions, i.e. make it hard for a single user to run on an
> overwhelmingly large proportion of a specific partition while there
> are pending jobs (even if submitted more recently) by other users?

A reservation (only allows users of a certain project access for a time period) 
could be arranged to facilitate what you require. However whether Abel will be 
operational by then is the question. I am not sure how much we can influence Saga
queue (as Abel is UiO BHM can arrange this). 

> 
> with thanks in advance, oe
> 

May I recommend to use one of the ML machines for this ? if that is a 
possibility we can arrange a meeting with Thomas about this. If not I will forward 
the request to Jon and Gard and find a solution.

Regards,
Sabry