<html><body><div style="font-family: arial, helvetica, sans-serif; font-size: 10pt; color: #000000"><div><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style><style>
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
p.x_MsoNormal, li.x_MsoNormal, div.x_MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
a:link, span.x_MsoHyperlink
{color:blue;
text-decoration:underline}
a:visited, span.x_MsoHyperlinkFollowed
{color:#954F72;
text-decoration:underline}
.x_MsoChpDefault
{}
@page WordSection1
{margin:70.85pt 56.7pt 70.85pt 56.7pt}
div.x_WordSection1
{}
-->
</style></div><div>Hi again,<br></div><div><br data-mce-bogus="1"></div><div>Sorry, I accidentally hit send too early.<br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>So my suspicion is that some environment setting is set slightly differently now than it used to be and this affects srun. Is removing srun from the script resolving the issue or is it only a workaround?<br data-mce-bogus="1"></div><div><br data-mce-bogus="1"></div><div>Martin<br data-mce-bogus="1"></div><div data-marker="__SIG_PRE__"><br data-mce-bogus="1"></div><div><br></div><hr id="zwchr" data-marker="__DIVIDER__"><div data-marker="__HEADERS__"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Martin Matthiesen" <martin.matthiesen@csc.fi><br><b>To: </b>"Yves Scherrer" <yves.scherrer@helsinki.fi><br><b>Cc: </b>"Stephan Oepen" <oe@ifi.uio.no>, "infrastructure" <infrastructure@nlpl.eu><br><b>Sent: </b>Thursday, 20 December, 2018 11:02:00<br><b>Subject: </b>Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)<br></blockquote></div><div data-marker="__QUOTED_TEXT__"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><div style="font-family: arial, helvetica, sans-serif; font-size: 10pt; color: #000000"><div></div><div>Hello Yves and all,<br></div><br><div>Here's a summary of the basic differences between srun and sbatch:<br></div><br><div>https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters<br></div><br><div>-- <br>Martin Matthiesen<br>CSC - Tieteen tietotekniikan keskus<br>CSC - IT Center for Science<br>PL 405, 02101 Espoo, Finland<br>+358 9 457 2376, martin.matthiesen@csc.fi<br>Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704<br>Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704</div><br><hr id="zwchr"><div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Yves Scherrer" <yves.scherrer@helsinki.fi><br><b>To: </b>"Stephan Oepen" <oe@ifi.uio.no><br><b>Cc: </b>"Martin Matthiesen" <martin.matthiesen@csc.fi>, "infrastructure" <infrastructure@nlpl.eu><br><b>Sent: </b>Wednesday, 19 December, 2018 17:47:49<br><b>Subject: </b>RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)<br></blockquote></div><div><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;">
<div lang="EN-US">
<div class="x_WordSection1">
<p class="x_MsoNormal">It looks like the “srun” (present in my script, absent in Stephan’s) was the culprit. I still have to say that I haven’t completely grasped its use – back in Theano times, it was compulsory (at least for me, but Jörg was able to run the
same jobs without it), now it seems that it must be avoided…</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Yves</p>
<p class="x_MsoNormal"> </p>
</div>
<hr style="display:inline-block; width:98%">
<div id="x_divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> Stephan Oepen <oe@ifi.uio.no><br>
<b>Sent:</b> Wednesday, December 19, 2018 3:48:05 PM<br>
<b>To:</b> Scherrer, Yves<br>
<b>Cc:</b> Martin Matthiesen; infrastructure<br>
<b>Subject:</b> Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)</font>
<div> </div>
</div>
</div>
<font size="2"><span style="font-size:11pt;">
<div class="PlainText">this is weird: as if i did not get the error, yves? i had stripped<br>
down your script to just the training; see<br>
<br>
/homeappl/home/oe/onmt.sh<br>
<br>
to confirm, i turned on that job once more earlier today (‘sbatch<br>
onmt.sh’), and it appears to be training happily for now. standard<br>
output and error from that job should be visible to you in my home<br>
directory.<br>
<br>
for ultimate comparability, could you also run that job<br>
<br>
sbatch ~oe/onmt.sh<br>
<br>
oe<br>
<br>
On Tue, Dec 18, 2018 at 3:15 PM Scherrer, Yves<br>
<yves.scherrer@helsinki.fi> wrote:<br>
><br>
> Hi,<br>
><br>
><br>
><br>
> My error occurs right away, I don’t even get these INFO messages… This is the full content of the training.*.err file:<br>
><br>
><br>
><br>
> Loading application python-3.5.3 environment with needed modules<br>
><br>
> THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=34 error=35 : CUDA driver version is insufficient for CUDA runtime version<br>
><br>
> Traceback (most recent call last):<br>
><br>
> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module><br>
><br>
> main(opt)<br>
><br>
> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main<br>
><br>
> single_main(opt)<br>
><br>
> File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main<br>
><br>
> opt = training_opt_postprocessing(opt)<br>
><br>
> File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing<br>
><br>
> torch.cuda.set_device(opt.device_id)<br>
><br>
> File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 264, in set_device<br>
><br>
> torch._C._cuda_setDevice(device)<br>
><br>
> RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/torch/csrc/cuda/Module.cpp:34<br>
><br>
> Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7ff93231b400><br>
><br>
> Traceback (most recent call last):<br>
><br>
> File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/weakref.py", line 117, in remove<br>
><br>
> TypeError: 'NoneType' object is not callable<br>
><br>
> srun: error: g110: task 0: Exited with exit code 1<br>
><br>
> srun: Terminating job step 33310480.0<br>
><br>
><br>
><br>
><br>
><br>
> ________________________________<br>
> From: Stephan Oepen <oe@ifi.uio.no><br>
> Sent: Tuesday, December 18, 2018 2:56:49 PM<br>
> To: Martin Matthiesen<br>
> Cc: Scherrer, Yves; infrastructure<br>
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)<br>
><br>
> thanks for adjusting those permissions, yves!<br>
><br>
> roughle how long into the job would you expect the error to occur?<br>
><br>
> i have been running for around six minutes so far, and training<br>
> appears to get going:<br>
><br>
> 2018-12-18 14:47:43,683 INFO] encoder: 14116000<br>
> [2018-12-18 14:47:43,683 INFO] decoder: 25862084<br>
> [2018-12-18 14:47:43,683 INFO] * number of parameters: 39978084<br>
> /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/_reduction.py:49:<br>
> UserWarning: size_average and reduce args will be deprecated, please<br>
> use reduction='sum' instead.<br>
> warnings.warn(warning.format(ret))<br>
> [2018-12-18 14:47:43,685 INFO] Start training...<br>
> [2018-12-18 14:47:43,707 INFO] Loading train dataset from<br>
> data.train.1.pt, number of examples: 1030<br>
> /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/functional.py:1320:<br>
> UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.<br>
> warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")<br>
> [2018-12-18 14:49:15,649 INFO] Loading train dataset from<br>
> data.train.10.pt, number of examples: 1162<br>
> [2018-12-18 14:50:55,474 INFO] Loading train dataset from<br>
> data.train.100.pt, number of examples: 1199<br>
> [2018-12-18 14:52:13,191 INFO] Step 50/100000; acc: 5.83; ppl:<br>
> 5884.51; xent: 8.68; lr: 1.00000; 272/262 tok/s; 269 sec<br>
> [2018-12-18 14:52:38,496 INFO] Loading train dataset from<br>
> data.train.1000.pt, number of examples: 1216<br>
><br>
> but earlier you had sent a traceback involving a function called<br>
> training_opt_postprocessing() ... so maybe the error ony occurs<br>
> towards the end of training? which would seem pretty weird, seeing as<br>
> i suppose PyTorch has been used extensively up to that point already?<br>
><br>
> oe<br>
><br>
><br>
><br>
> On Tue, Dec 18, 2018 at 10:11 AM Martin Matthiesen<br>
> <martin.matthiesen@csc.fi> wrote:<br>
> ><br>
> > Hi,<br>
> ><br>
> > I did try for an hour and a bit yesterday to pinpoint the problem, but could not make head or tail of it. Did I understand correctly that Stephan, you got this working on Taito?<br>
> ><br>
> > Martin<br>
> ><br>
> > P.S.: Should we keep infrastructure out of this or is this interesting to Jörg and Björn?<br>
> ><br>
> > --<br>
> > Martin Matthiesen<br>
> > CSC - Tieteen tietotekniikan keskus<br>
> > CSC - IT Center for Science<br>
> > PL 405, 02101 Espoo, Finland<br>
> > +358 9 457 2376, martin.matthiesen@csc.fi<br>
> > Public key : <a href="https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704" target="_blank">
https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704</a><br>
> > Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704<br>
> ><br>
> > ________________________________<br>
> ><br>
> > From: "Yves Scherrer" <yves.scherrer@helsinki.fi><br>
> > To: "Stephan Oepen" <oe@ifi.uio.no><br>
> > Cc: "Martin Matthiesen" <martin.matthiesen@csc.fi>, "infrastructure" <infrastructure@nlpl.eu><br>
> > Sent: Tuesday, 18 December, 2018 10:35:25<br>
> > Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)<br>
> ><br>
> > Hi,<br>
> ><br>
> ><br>
> ><br>
> > > could you make the complete data directory group- or world-readable,<br>
> > > so i can try running the ‘train.py’ script without creating my own<br>
> > > copy of the data?<br>
> ><br>
> ><br>
> ><br>
> > That should work now.<br>
> ><br>
> ><br>
> ><br>
> > > thinking (possbily over-)optimistically, maybe the problem has<br>
> > > magically disappeared already?<br>
> ><br>
> ><br>
> ><br>
> > Unfortunately, it hasn’t. Or was I supposed to reinstall the OpenNMT-py module locally?<br>
> ><br>
> ><br>
> ><br>
> > Yves<br>
> ><br>
> ><br>
> ><br>
> ><br>
</div>
</span></font></blockquote></div></div><br></blockquote></div></div></body></html>