<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252"> <meta name="Generator" content="Microsoft Exchange Server"> <style></style> </head> <body> <meta content="text/html; charset=UTF-8"> <style type="text/css" style="">  </style> <div dir="ltr"> <div id="x_divtagdefaultwrapper" dir="ltr" style="font-size:12pt; color:#000000; font-family:Calibri,Helvetica,sans-serif"> Hi, Thanks so much for your great help! Now I have fixed it! I added "source /cluster/bin/jobsetup" and chkfile command into the script. The most important change is: I replaced "-gpu_rank 1" with "-gpuid 0". This is more related to using OpenNMT. Have a nice weekend! Best, Gongbo </div> <hr tabindex="-1" style="display:inline-block; width:98%"> <div id="x_divRplyFwdMsg" dir="ltr">From: rt-user@ulrik.uio.no <rt-user@ulrik.uio.no> on behalf of Thomas Röblitz via RT <hpc-drift@usit.uio.no> Sent: Saturday, September 28, 2019 8:32:16 AM To: Gongbo Tang Cc: infrastructure@nlpl.eu; oe@ifi.uio.no Subject: Re: [rt.uio.no #3588329] Re: [NLPL Task Force (A)] Jobs' output <div> </div> </div> </div> <div class="PlainText">Hei guys, a few quick comments. First, the job script lacks the ‘source /cluster/bin/jobsetup’, which is the reason for not having access to commands such as chkfile and cleanup. Also not sourcing it means you don’t get a header and footer in your output file. These tell you a number of important stats on your job (start, end, resource usage). If you scroll up a bit on the URL you had in your email, you see an example job script which shows this ‘source’. Second, contrary to what Stephan writes working on $HOME is discouraged. Even if YOUR job is not I/O intensive. I hope that users under NLPL are teached otherwise or I would be a bit worried ;) BTW the simple job example on the URL above also shows a standard pattern for setting up the job input data on $SCRATCH and then copying results back to the submission dir. Also if you run ‘sacct’ it shows that your job completed successfully. # sacct -j 28250251 -o jobid,jobname,partition,user,account,alloccpus,state,exitcode,timelimit,elapsed JobID JobName Partition User Account AllocCPUS State ExitCode Timelimit Elapsed ------------ ---------- ---------- --------- ---------- ---------- ---------- -------- ---------- ---------- 28250251 test accel gtang nn9447k 4 COMPLETED 0:0 01:00:00 00:13:42 28250251.ba+ batch nn9447k 4 COMPLETED 0:0 00:13:42 Finally, at least now you seem to have sufficient space under your $HOME directory so copying back result from $SCRATCH should work unless you have more output data than ~ 500 GiB. The exact reason you’re not seeing the GPU not being used, I don’t know. However, it might be good to add a command like ‘nvidia-smi’ to your job script to check if you can see a GPU. Also, you might add ‘export ‘ in front of setting CUDA_VISIBLE_DEVICES or write this in front of the command you run (CUDA_VISIBLE_DEVICES=0 python ...). Also you might check if ‘-gpu_rank 1’ refers to the same GPU chosen with the environment variable (CUDA...). Have a nice weekend Thomas -- Research Infrastructure Services, USIT University of Oslo, Norway > On 27. Sep 2019, at 23:39, gongbo.tang@lingfil.uu.se via RT <hpc-drift@usit.uio.no> wrote: > > > 2019-09-27 23:39:16: Request 3588329 was acted upon. > Transaction: Ticket created by gongbo.tang@lingfil.uu.se > Queue: hpc-drift > Subject: Re: [NLPL Task Force (A)] Jobs' output > Owner: Nobody > Requestors: gongbo.tang@lingfil.uu.se > Status: new > Ticket <URL: <a href="https://rt.uio.no/Ticket/Display.html?id=3588329">https://rt.uio.no/Ticket/Display.html?id=3588329</a> > > > > Hi Stephan, > > > Thanks for your quick help. > > > Yes, I mean the model files generated by OpenNMT. Actually, the job has finished and the slurm log file has all the log information. > > > I just tried to run the job on the log-node (cpu), the OpenNMT could generate models successfully. I think the problem happens when using GPUs. The log file should have a log like this: > > > "[2019-09-27 21:24:57,888 INFO] Saving checkpoint /usit/abel/u1/gtang/model_word_step_200.pt" > > > However, there is no such log in the slurm file. I guess that OpenNMT does not save models at all. (have no access?) > > > Best, > > Gongbo > > ________________________________ > From: Stephan Oepen <oe@ifi.uio.no> > Sent: Friday, September 27, 2019 10:21:29 PM > To: Gongbo Tang > Cc: hpc@usit.uio.no; infrastructure > Subject: Re: [NLPL Task Force (A)] Jobs' output > > hi gonbo, > > i cannot say that i have used OpenNMT much myself, but more generally: > unless i run something that is very i/o-intensive, i do not take the > trouble of copying input and output data back and forth between the > $SCRATCH filesystem, i.e. i doubt you need to worry about chkfile and > friends. i would just work out of your home directory, i.e. read and > write data there. > > the SLURM log file you sent does not look as if the job actually has > completed? i assume by 'output files' you mean files generated during > the OpenNMT run, i.e. the actual model file? i might guess that the > model is only serialized to disk upon completion of the training, so > could it be the case that your job actually had not gotten to that > point? > > a general piece of advice: to debug it might help to reduce the > problem to a tiny training file, possibly even something that can > complete in a matter of a few minutes on a cpu node. that should > allow you to find out where the output file(s) end up, and once you > have a working set-up, you can submit larger jobs (to the gpu nodes). > > best wishes, oe > >> On Fri, Sep 27, 2019 at 10:09 PM Gongbo Tang <gongbo.tang@lingfil.uu.se> wrote: >> >> Hi, >> >> >> I met a problem. I cannot find any output files/models after running a job. Or the job did not generate any models during running. >> >> >> I am using Open-NMT 0.2.1, maintained by NLPL. I did not find any "Saving checkpoint ..." information from the log file which should be found. I attached the slurm file and the job script. >> >> >> I tried to use "chkfile" or "cleanup" command to save the outputs, following the guide here (<a href="https://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/job-scripts.html#Work_Directory">https://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/job-scripts.html#Work_Directory</a>), but I was told that "chkfile" and "cleanup" are not found. >> >> >> I also tried to set the output directory as the home directory(~, /usit/abel/u1/gtang). I still got nothing. >> >> >> Could you please tell me how can I get the job's outputs? Thanks a lot! >> >> >> Best, >> >> Gongbo >> >> >> >> >> >> >> >> >> >> När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: <a href="http://www.uu.se/om-uu/dataskydd-personuppgifter/">http://www.uu.se/om-uu/dataskydd-personuppgifter/</a> >> >> E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: <a href="http://www.uu.se/en/about-uu/data-protection-policy">http://www.uu.se/en/about-uu/data-protection-policy</a> </div> </body> </html>