[NLPL Task Force (A)] gpu usage on Abel for teaching in october

Thu Aug 29 19:46:57 UTC 2019

colleagues,

under the NLPL umbrella, we are planning a series of lab assignments
on neural machine translation (NMT) between late september and
mid-november.  there will likely be around ten student teams who each
will want to run frequent multi-hour jobs on one gpu for those weeks.
in principle, the Abel hardware would be fully sufficient for this
purpose, but we would somehow need to make sure that a non-trivial
fraction of the available gpu capacity will actually be available.  i
am optimistically assuming that most users will have migrated to Saga
by mid-september, and that Abel remains operational until at least
sometime into november.

do these assumptions sound plausible?  if need be, do we have
mechanisms in place to prevent other Abel users from saturating the
gpu queue for days into the future, or otherwise making sure that
shorter, one-gpu jobs get scheduled inbetween?  this challenge will
likely also be relevant on Saga more or less from the beginning: at
least during the trial period, andrey and vinit felt that at times it
was near-impossible to get gpu jobs running within a couple of days,
because other users had put dozens of multi-gpu jobs into the queue.
is there any principle of fairness across users built into the
scheduling decisions, i.e. make it hard for a single user to run on an
overwhelmingly large proportion of a specific partition while there
are pending jobs (even if submitted more recently) by other users?

with thanks in advance, oe