[NLPL Task Force (A)] [rt.uio.no #3588329] Re: Jobs' output

Sat Sep 28 06:32:16 UTC 2019

Hei guys,

a few quick comments.

First, the job script lacks the ‘source /cluster/bin/jobsetup’, which is the reason for not having access to commands such as chkfile and cleanup. Also not sourcing it means you don’t get a header and footer in your output file. These tell you a number of important stats on your job (start, end, resource usage). If you scroll up a bit on the URL you had in your email, you see an example job script which shows this ‘source’.

Second, contrary to what Stephan writes working on $HOME is discouraged. Even if YOUR job is not I/O intensive.

I hope that users under NLPL are teached otherwise or I would be a bit worried ;)

BTW the simple job example on the URL above also shows a standard pattern for setting up the job input data on $SCRATCH and then copying results back to the submission dir.

Also if you run ‘sacct’ it shows that your job completed successfully.

# sacct -j 28250251 -o jobid,jobname,partition,user,account,alloccpus,state,exitcode,timelimit,elapsed
       JobID    JobName  Partition      User    Account  AllocCPUS      State ExitCode  Timelimit    Elapsed 
------------ ---------- ---------- --------- ---------- ---------- ---------- -------- ---------- ---------- 
28250251           test      accel     gtang    nn9447k          4  COMPLETED      0:0   01:00:00   00:13:42 
28250251.ba+      batch                         nn9447k          4  COMPLETED      0:0              00:13:42 

Finally, at least now you seem to have sufficient space under your $HOME directory so copying back result from $SCRATCH should work unless you have more output data than ~ 500 GiB.

The exact reason you’re not seeing the GPU not being used, I don’t know. However, it might be good to add a command like ‘nvidia-smi’ to your job script to check if you can see a GPU. Also, you might add ‘export ‘ in front of setting CUDA_VISIBLE_DEVICES or write this in front of the command you run (CUDA_VISIBLE_DEVICES=0 python ...). Also you might check if ‘-gpu_rank 1’ refers to the same GPU chosen with the environment variable (CUDA...).

Have a nice weekend

Thomas

--
Research Infrastructure Services, USIT
University of Oslo, Norway

> On 27. Sep 2019, at 23:39, gongbo.tang at lingfil.uu.se via RT <hpc-drift at usit.uio.no> wrote:
> 
> 
> 2019-09-27 23:39:16: Request 3588329 was acted upon.
> Transaction: Ticket created by gongbo.tang at lingfil.uu.se
>      Queue: hpc-drift
>    Subject: Re: [NLPL Task Force (A)] Jobs' output
>      Owner: Nobody
> Requestors: gongbo.tang at lingfil.uu.se
>     Status: new
> Ticket <URL: https://rt.uio.no/Ticket/Display.html?id=3588329 >
> 
> 
> Hi Stephan,
> 
> 
> Thanks for your quick help.
> 
> 
> Yes, I mean the model files generated by OpenNMT. Actually, the job has finished and the slurm log file has all the log information.
> 
> 
> I just tried to run the job on the log-node (cpu), the OpenNMT could generate models successfully. I think the problem happens when using GPUs. The log file should have a log like this:
> 
> 
> "[2019-09-27 21:24:57,888 INFO] Saving checkpoint /usit/abel/u1/gtang/model_word_step_200.pt"
> 
> 
> However, there is no such log in the slurm file. I guess that OpenNMT does not save models at all. (have no access?)
> 
> 
> Best,
> 
> Gongbo
> 
> ________________________________
> From: Stephan Oepen <oe at ifi.uio.no>
> Sent: Friday, September 27, 2019 10:21:29 PM
> To: Gongbo Tang
> Cc: hpc at usit.uio.no; infrastructure
> Subject: Re: [NLPL Task Force (A)] Jobs' output
> 
> hi gonbo,
> 
> i cannot say that i have used OpenNMT much myself, but more generally:
> unless i run something that is very i/o-intensive, i do not take the
> trouble of copying input and output data back and forth between the
> $SCRATCH filesystem, i.e. i doubt you need to worry about chkfile and
> friends.  i would just work out of your home directory, i.e. read and
> write data there.
> 
> the SLURM log file you sent does not look as if the job actually has
> completed?  i assume by 'output files' you mean files generated during
> the OpenNMT run, i.e. the actual model file?  i might guess that the
> model is only serialized to disk upon completion of the training, so
> could it be the case that your job actually had not gotten to that
> point?
> 
> a general piece of advice: to debug it might help to reduce the
> problem to a tiny training file, possibly even something that can
> complete in a matter of a few minutes on a cpu node.  that should
> allow you to find out where the output file(s) end up, and once you
> have a working set-up, you can submit larger jobs (to the gpu nodes).
> 
> best wishes, oe
> 
>> On Fri, Sep 27, 2019 at 10:09 PM Gongbo Tang <gongbo.tang at lingfil.uu.se> wrote:
>> 
>> Hi,
>> 
>> 
>> I met a problem. I cannot find any output files/models after running a job. Or the job did not generate any models during running.
>> 
>> 
>> I am using Open-NMT 0.2.1, maintained by NLPL. I did not find any "Saving checkpoint ..." information from the log file which should be found. I attached the slurm file and the job script.
>> 
>> 
>> I tried to use "chkfile" or "cleanup" command to save the outputs, following the guide here (https://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/job-scripts.html#Work_Directory), but I was told that "chkfile" and "cleanup" are not found.
>> 
>> 
>> I also tried to set the output directory as the home directory(~, /usit/abel/u1/gtang). I still got nothing.
>> 
>> 
>> Could you please tell me how can I get the job's outputs? Thanks a lot!
>> 
>> 
>> Best,
>> 
>> Gongbo
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
>> 
>> E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy