[NLPL Task Force (A)] big array job
Stephan Oepen
oe at ifi.uio.no
Sat Feb 16 14:06:23 UTC 2019
hi asad,
i am glad you are about to put some load on the system :-). as long
as you stick within the published limit of a maximum of 400 jobs (and
your typical job behaves itself, e.g. does not put undue strain on the
file system), i see no reason why you should be overly careful. i
would recommend you keep an eye on your jobs, at least in the
beginning, and monitor your mailbox ... in case the system
administrators find something to remark.
are all of these jobs single-threaded? i am no big fan of the Abel
arrayrun(1) facility. what i usually do is create a large file with
as many command lines as i want to run jobs, each with whatever
parameters that job requires. a silly example of such a master job
file could be something like
for i in 0 1 2 3 4 5 6 7 8 9; do
for j in 0 1 2 3 4 5 6 7 8 9; do
echo "sbatch ${HOME}/echo.slurm ${i} ${j}";
done;
done > ~/echo.jobs
assuming such a file, i have a script that ‘trickles’ through the
sequence of jobs, keeping up to some maximum limit of queue entries at
any point in time, and filling up the queue to the limit again as jobs
terminate. my idiom of setting into motion this process then goes as
follows:
/projects/nlpl/operation/tools/trickle --start --limit 20 ~/echo.jobs
while true; do /projects/nlpl/operation/tools/trickle --limit 20
~/echo.jobs ; sleep 30; done
[19-02-16 15:00:37] trickle[20]: 20 jobs; 3 running; 0 new.
[19-02-16 15:01:07] trickle[20]: 17 jobs; 0 running; 3 new.
[19-02-16 15:01:38] trickle[23]: 17 jobs; 0 running; 3 new.
[19-02-16 15:02:10] trickle[26]: 20 jobs; 3 running; 0 new.
the first integer is the pointer into the job sequence, 20 initially,
then at each step advancing by the number of new jobs submitted for
that call.
—just in case you might find this useful ... for all i know, this
script provides similar functionality to arrayrun(1), but i find it
more convenient to be able to pass each job its full command line
directly, without having to redirect on the job indices under
arrayrun(1) control.
i will be curious to know how these jobs turn our for you :-)! oe
On Sat, Feb 16, 2019 at 1:42 PM Asad Sayeed <asad.sayeed at gu.se> wrote:
>
> Hi,
>
> The abel documentation says users are allowed to run up to 400 jobs
> simultaneously. If I run arrayrun 4x on different segments of the
> corpus, will I get myself into trouble with the authorities or
> something? 400 at a time is a significant time saving for me, obviously
> (2.5 days for the whole thing).
>
> Thanks.
>
> Yours,
> --Asad.
>
>
> On 2019-02-16 01:29 PM, Asad Sayeed wrote:
> > Hi Stephan,
> >
> > I am now trying to scale up my SRL task "for real" over 70M sentences,
> > divided up into 3500 segments/tasks, each taking about 12G memory each
> > and taking about 7 hours. I am trying to use arrayrun on abel on my
> > script. However, it seems like arrayrun will only activate 100 jobs
> > at a time. This will take 10 days to run the entire job, which is
> > slower than the much smaller cluster I was running it on elsewhere
> > (where I can run about 300 at a time and take 14 hours, for about 7
> > days). I was hoping to gain a signficant turnaround time for
> > experimentation on abel. Is there any way to get more on abel or is
> > that a hard limit?
> >
> > Thanks.
> >
> > Yours,
> > --Asad.
> >
>
More information about the infrastructure
mailing list