<div><div dir="auto">hi again,</div><div dir="auto"><br></div><div dir="auto">one mechanism that i believe Taito users in finland appreciate is its separate gpu testing partition: jobs in this partition are severely limited (a maximum run time of half an hour or so) but heavily prioritized by the scheduler. because gpu jobs typically cannot be tested on login nodes, this makes system development and debugging a lot easier.</div></div><div dir="auto"><br></div><div dir="auto">seeing as we already anticipate fierce demand for gpu capacity, maybe Saga could implement a similar mechanism?</div><div dir="auto"><br></div><div dir="auto">cheers, oe</div><div dir="auto"><br></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 30 Aug 2019 at 07:38 Thomas Röblitz via RT <<a href="mailto:hpc-drift@usit.uio.no">hpc-drift@usit.uio.no</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
> On 29. Aug 2019, at 21:47, Stephan Oepen via RT <<a href="mailto:hpc-drift@usit.uio.no" target="_blank">hpc-drift@usit.uio.no</a>> wrote:<br>
> <br>
> <br>
> i<br>
> am optimistically assuming that most users will have migrated to Saga<br>
> by mid-september,<br>
<br>
Nope. More likely “mid-december”.<br>
<br>
> and that Abel remains operational until at least<br>
> sometime into november.<br>
<br>
Yes. More likely Abel will enter a new decade.<br>
<br>
BUT, no guarantee that nn9447k is still a project on Abel and/or would have access to Saga & Abel. That depends on Sigma2.<br>
<br>
> <br>
> do these assumptions sound plausible? <br>
<br>
See above ;)<br>
<br>
> if need be, do we have<br>
> mechanisms in place to prevent other Abel users from saturating the<br>
> gpu queue for days into the future, or otherwise making sure that<br>
> shorter, one-gpu jobs get scheduled inbetween?<br>
<br>
Nah, then course users wouldn’t get the full cluster experience. Usually we don’t like to make such special arrangements particularly for such a long time.<br>
<br>
> this challenge will<br>
> likely also be relevant on Saga more or less from the beginning: at<br>
> least during the trial period, andrey and vinit felt that at times it<br>
<br>
Yeah, it was a pilot phase.<br>
<br>
> was near-impossible to get gpu jobs running within a couple of days,<br>
> because other users had put dozens of multi-gpu jobs into the queue.<br>
<br>
I think, that was done on request by us, because GPUs were idling. But sure, I’d expect very long queues for GPUs. They are newer, more performant and more easy to use with the latest software packages.<br>
<br>
> is there any principle of fairness across users built into the<br>
> scheduling decisions, i.e. make it hard for a single user to run on an<br>
> overwhelmingly large proportion of a specific partition while there<br>
> are pending jobs (even if submitted more recently) by other users?<br>
<br>
Currently, I think, there is no such policy in place. It might be possible to limit numbers of submitted/running jobs per account or per user for a partition. Since many projects will likely want to use the GPUs, a fair policy would probably need to implement limitations across projects, i.e., max 4 submitted/running jobs per project account.<br>
<br>
Thomas<br>
<br>
> <br>
> with thanks in advance, oe<br>
> <br>
<br>
</blockquote></div></div>