<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from text --><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
</head>
<body>
<style>
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
p.x_MsoNormal, li.x_MsoNormal, div.x_MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
a:link, span.x_MsoHyperlink
{color:blue;
text-decoration:underline}
a:visited, span.x_MsoHyperlinkFollowed
{color:#954F72;
text-decoration:underline}
.x_MsoChpDefault
{}
@page WordSection1
{margin:70.85pt 56.7pt 70.85pt 56.7pt}
div.x_WordSection1
{}
-->
</style>
<div lang="EN-US" link="blue" vlink="#954F72">
<div class="x_WordSection1">
<p class="x_MsoNormal">Hi again,</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">OpenNMT-py picks up the new version of PyTorch, but the CUDA error that I signalled a few weeks back is still there:</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=35 : CUDA driver version is insufficient for CUDA runtime version</p>
<p class="x_MsoNormal">Traceback (most recent call last):</p>
<p class="x_MsoNormal"> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module></p>
<p class="x_MsoNormal"> main(opt)</p>
<p class="x_MsoNormal"> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main</p>
<p class="x_MsoNormal"> single_main(opt)</p>
<p class="x_MsoNormal"> File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main</p>
<p class="x_MsoNormal"> opt = training_opt_postprocessing(opt)</p>
<p class="x_MsoNormal"> File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing</p>
<p class="x_MsoNormal"> torch.cuda.set_device(opt.device_id)</p>
<p class="x_MsoNormal"> File "/proj/nlpl/software/pytorch/201811/lib/python3.5/site-packages/torch/cuda/__init__.py", line 262, in set_device</p>
<p class="x_MsoNormal"> torch._C._cuda_setDevice(device)</p>
<p class="x_MsoNormal">RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32</p>
<p class="x_MsoNormal">srun: error: g110: task 0: Exited with exit code 1</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">So in the end it looks like this error might be unrelated to the PyTorch error we’ve seen lately... Please let me know if I can help you with debugging this…</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Yves</p>
<p class="x_MsoNormal"> </p>
</div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="x_divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Stephan Oepen <oe@ifi.uio.no><br>
<b>Sent:</b> Monday, December 10, 2018 11:35:05 PM<br>
<b>To:</b> Scherrer, Yves<br>
<b>Cc:</b> infrastructure<br>
<b>Subject:</b> Re: PyTorch on Taito gpu nodes</font>
<div> </div>
</div>
</div>
<font size="2"><span style="font-size:11pt;">
<div class="PlainText">since ‘torchtext’ is in our current nlpl-opennmt-py installation, i<br>
added its dependency requests to OpenNMT-py just now. ultimately,<br>
however, it sounds as if both maybe should be part of the base PyTorch<br>
...<br>
<br>
so please give it another shot! oe<br>
<br>
<br>
<br>
On Mon, Dec 10, 2018 at 10:26 PM Scherrer, Yves<br>
<yves.scherrer@helsinki.fi> wrote:<br>
><br>
> Hi Stephan,<br>
><br>
> It almost works… I’m getting the following error now:<br>
><br>
> File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torchtext/utils.py", line 2, in <module><br>
> import requests<br>
> ImportError: No module named ‘requests'<br>
><br>
> The situation is the following:<br>
> - The opennmt-py/0.2.1 environment contains ‘torchtext', but not its dependency ‘requests'.<br>
> - The (current) pytorch/201811 environment contains neither ‘torchtext' nor its dependency ‘requests’.<br>
> - The (old) pytorch/0.4.1 environment does not contain ‘torchtext', but contains ‘requests’, which explains why it worked before.<br>
><br>
> I don’t know if it makes more sense to add the requests module to the pytorch environment or to the opennmt-py one...<br>
><br>
> Best,<br>
> Yves<br>
><br>
> > On 10 Dec 2018, at 22:45, Stephan Oepen <oe@ifi.uio.no> wrote:<br>
> ><br>
> > hi yves,<br>
> ><br>
> > no, i have not heard more on this issue, but i would expect<br>
> > nlpl-opennmt-py to work already now: it loads the default version of<br>
> > nlpl-pytorch, which currently (still) is the one i installed recently,<br>
> > i.e. the functional one. could you give that a try please?<br>
> ><br>
> > best, oe<br>
> ><br>
> > On Mon, Dec 10, 2018 at 9:22 PM Scherrer, Yves<br>
> > <yves.scherrer@helsinki.fi> wrote:<br>
> >><br>
> >> Hi Stephan, all,<br>
> >><br>
> >> Has there been any recent activity on this matter? Did you get any news from CSC about some internal changes they made? I am asking because I might need a working OpenNMT-py install soon for some of the December NLPL milestones…<br>
> >><br>
> >> Best,<br>
> >> Yves<br>
> >><br>
> >>> On 28 Nov 2018, at 19:45, Stephan Oepen <oe@ifi.uio.no> wrote:<br>
> >>><br>
> >>> hi yves and eivind,<br>
> >>><br>
> >>> you both discovered independently today that the default NLPL<br>
> >>> installation of PyTorch on Taito appears to have lost its support for<br>
> >>> gpu nodes. the software has not changed since october, so i suspect<br>
> >>> that some system-wide upgrade of the nvidia drivers or cuda libraries<br>
> >>> may be the cause of these problems. i have been unable to fully track<br>
> >>> down what happened, but ...<br>
> >>><br>
> >>> it (somewhat surprisingly) appears to be the case that re-doing my<br>
> >>> original PyTorch installation recipe (using the exact same explicit<br>
> >>> dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in<br>
> >>> a functional PyTorch installation again. for tonight, i am keeping<br>
> >>> the original (gpu-dysfunctional) version for further debugging. but<br>
> >>> please try the following:<br>
> >>><br>
> >>> $ module purge; module load nlpl-pytorch/201811<br>
> >>> $ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \<br>
> >>> python3 /proj/nlpl/software/pytorch/0.4.1/test.py<br>
> >>> True<br>
> >>><br>
> >>> martin, do you think it is worth checking with the CSC cuda<br>
> >>> wizards—they might be in a much better position to guess which<br>
> >>> system-wide external parameters have changed recently? to experience<br>
> >>> the problem, replace the module version ‘201811’ with ‘0.4.1’; the<br>
> >>> above test script should output False, i.e. cuda is not available.<br>
> >>><br>
> >>> when debugging earlier today, yves observed that our installation of<br>
> >>> OpenNMT-py (which is built on top of PyTorch) complains:<br>
> >>><br>
> >>> File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",<br>
> >>> line 262, in set_device<br>
> >>><br>
> >>> torch._C._cuda_setDevice(device)<br>
> >>><br>
> >>> RuntimeError: cuda runtime error (35) : CUDA driver version is<br>
> >>> insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32<br>
> >>><br>
> >>> i am all but certain that this is the same root problem, only when<br>
> >>> trying to initialize from OpenNMT-py we somehow run into a full-blown<br>
> >>> exception, whereas my PyTorch test script merely reports cuda as not<br>
> >>> being available.<br>
> >>><br>
> >>> i had originally promised myself and everyone who would listen that<br>
> >>> module installations must remain unchanged, once announced publicly.<br>
> >>> fixing what appears to be a fatal (if mysterious) flaw in our original<br>
> >>> PyTorch 0.4.1 module, however, may present an exception to that<br>
> >>> policy. if we fail to work out what exactly caused the recent loss of<br>
> >>> gpu support in that module in the next few days, i think i will just<br>
> >>> wipe out the 0.4.1 installation and replace it with a fresh<br>
> >>> re-installation (which must be expected to be functionally<br>
> >>> equivalent).<br>
> >>><br>
> >>> is everyone okay with that approach, in principle, infrastructure task force?<br>
> >>><br>
> >>> cheers, oe<br>
> >><br>
><br>
</div>
</span></font>
</body>
</html>