[NLPL Task Force (A)] big array job
Asad Sayeed
asad.sayeed at gu.se
Sun Feb 17 01:21:18 UTC 2019
Thanks. Maybe I was making some kind of weird mistake, but arrayrun
seems to cancel any job that you try to run above the 100 already
running if you try to run it in parallel. I will try your script instead.
...
Actually, running the existing arrayrun in progress and trickle
simultaneously has an interesting effect. We shall see.
Yours,
--Asad.
On 2019-02-16 03:06 PM, Stephan Oepen wrote:
> hi asad,
>
> i am glad you are about to put some load on the system :-). as long
> as you stick within the published limit of a maximum of 400 jobs (and
> your typical job behaves itself, e.g. does not put undue strain on the
> file system), i see no reason why you should be overly careful. i
> would recommend you keep an eye on your jobs, at least in the
> beginning, and monitor your mailbox ... in case the system
> administrators find something to remark.
>
> are all of these jobs single-threaded? i am no big fan of the Abel
> arrayrun(1) facility. what i usually do is create a large file with
> as many command lines as i want to run jobs, each with whatever
> parameters that job requires. a silly example of such a master job
> file could be something like
>
> for i in 0 1 2 3 4 5 6 7 8 9; do
> for j in 0 1 2 3 4 5 6 7 8 9; do
> echo "sbatch ${HOME}/echo.slurm ${i} ${j}";
> done;
> done > ~/echo.jobs
>
> assuming such a file, i have a script that ‘trickles’ through the
> sequence of jobs, keeping up to some maximum limit of queue entries at
> any point in time, and filling up the queue to the limit again as jobs
> terminate. my idiom of setting into motion this process then goes as
> follows:
>
> /projects/nlpl/operation/tools/trickle --start --limit 20 ~/echo.jobs
> while true; do /projects/nlpl/operation/tools/trickle --limit 20
> ~/echo.jobs ; sleep 30; done
> [19-02-16 15:00:37] trickle[20]: 20 jobs; 3 running; 0 new.
> [19-02-16 15:01:07] trickle[20]: 17 jobs; 0 running; 3 new.
> [19-02-16 15:01:38] trickle[23]: 17 jobs; 0 running; 3 new.
> [19-02-16 15:02:10] trickle[26]: 20 jobs; 3 running; 0 new.
>
> the first integer is the pointer into the job sequence, 20 initially,
> then at each step advancing by the number of new jobs submitted for
> that call.
>
> —just in case you might find this useful ... for all i know, this
> script provides similar functionality to arrayrun(1), but i find it
> more convenient to be able to pass each job its full command line
> directly, without having to redirect on the job indices under
> arrayrun(1) control.
>
> i will be curious to know how these jobs turn our for you :-)! oe
>
> On Sat, Feb 16, 2019 at 1:42 PM Asad Sayeed <asad.sayeed at gu.se> wrote:
>> Hi,
>>
>> The abel documentation says users are allowed to run up to 400 jobs
>> simultaneously. If I run arrayrun 4x on different segments of the
>> corpus, will I get myself into trouble with the authorities or
>> something? 400 at a time is a significant time saving for me, obviously
>> (2.5 days for the whole thing).
>>
>> Thanks.
>>
>> Yours,
>> --Asad.
>>
>>
>> On 2019-02-16 01:29 PM, Asad Sayeed wrote:
>>> Hi Stephan,
>>>
>>> I am now trying to scale up my SRL task "for real" over 70M sentences,
>>> divided up into 3500 segments/tasks, each taking about 12G memory each
>>> and taking about 7 hours. I am trying to use arrayrun on abel on my
>>> script. However, it seems like arrayrun will only activate 100 jobs
>>> at a time. This will take 10 days to run the entire job, which is
>>> slower than the much smaller cluster I was running it on elsewhere
>>> (where I can run about 300 at a time and take 14 hours, for about 7
>>> days). I was hoping to gain a signficant turnaround time for
>>> experimentation on abel. Is there any way to get more on abel or is
>>> that a hard limit?
>>>
>>> Thanks.
>>>
>>> Yours,
>>> --Asad.
>>>
More information about the infrastructure
mailing list