<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from text --><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
</head>
<body>
<style>
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
p.x_MsoNormal, li.x_MsoNormal, div.x_MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
a:link, span.x_MsoHyperlink
{color:blue;
text-decoration:underline}
a:visited, span.x_MsoHyperlinkFollowed
{color:#954F72;
text-decoration:underline}
.x_MsoChpDefault
{}
@page WordSection1
{margin:70.85pt 56.7pt 70.85pt 56.7pt}
div.x_WordSection1
{}
-->
</style>
<div lang="EN-US" link="blue" vlink="#954F72">
<div class="x_WordSection1">
<p class="x_MsoNormal">Hi,</p>
<p class="x_MsoNormal">Thanks for looking into this, Stephan. I confirm that the 201811 version of PyTorch is running correctly on GPU.</p>
<p class="x_MsoNormal">I have no objection about replacing the current 0.4.1 installation with a working one.</p>
<p class="x_MsoNormal">Best,</p>
<p class="x_MsoNormal">Yves</p>
<p class="x_MsoNormal"> </p>
</div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="x_divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Stephan Oepen <oe@ifi.uio.no><br>
<b>Sent:</b> Wednesday, November 28, 2018 7:45:03 PM<br>
<b>To:</b> Scherrer, Yves; Eivind Alexander Bergem<br>
<b>Cc:</b> infrastructure<br>
<b>Subject:</b> PyTorch on Taito gpu nodes</font>
<div> </div>
</div>
</div>
<font size="2"><span style="font-size:11pt;">
<div class="PlainText">hi yves and eivind,<br>
<br>
you both discovered independently today that the default NLPL<br>
installation of PyTorch on Taito appears to have lost its support for<br>
gpu nodes. the software has not changed since october, so i suspect<br>
that some system-wide upgrade of the nvidia drivers or cuda libraries<br>
may be the cause of these problems. i have been unable to fully track<br>
down what happened, but ...<br>
<br>
it (somewhat surprisingly) appears to be the case that re-doing my<br>
original PyTorch installation recipe (using the exact same explicit<br>
dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in<br>
a functional PyTorch installation again. for tonight, i am keeping<br>
the original (gpu-dysfunctional) version for further debugging. but<br>
please try the following:<br>
<br>
$ module purge; module load nlpl-pytorch/201811<br>
$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \<br>
python3 /proj/nlpl/software/pytorch/0.4.1/test.py<br>
True<br>
<br>
martin, do you think it is worth checking with the CSC cuda<br>
wizards—they might be in a much better position to guess which<br>
system-wide external parameters have changed recently? to experience<br>
the problem, replace the module version ‘201811’ with ‘0.4.1’; the<br>
above test script should output False, i.e. cuda is not available.<br>
<br>
when debugging earlier today, yves observed that our installation of<br>
OpenNMT-py (which is built on top of PyTorch) complains:<br>
<br>
File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",<br>
line 262, in set_device<br>
<br>
torch._C._cuda_setDevice(device)<br>
<br>
RuntimeError: cuda runtime error (35) : CUDA driver version is<br>
insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32<br>
<br>
i am all but certain that this is the same root problem, only when<br>
trying to initialize from OpenNMT-py we somehow run into a full-blown<br>
exception, whereas my PyTorch test script merely reports cuda as not<br>
being available.<br>
<br>
i had originally promised myself and everyone who would listen that<br>
module installations must remain unchanged, once announced publicly.<br>
fixing what appears to be a fatal (if mysterious) flaw in our original<br>
PyTorch 0.4.1 module, however, may present an exception to that<br>
policy. if we fail to work out what exactly caused the recent loss of<br>
gpu support in that module in the next few days, i think i will just<br>
wipe out the 0.4.1 installation and replace it with a fresh<br>
re-installation (which must be expected to be functionally<br>
equivalent).<br>
<br>
is everyone okay with that approach, in principle, infrastructure task force?<br>
<br>
cheers, oe<br>
</div>
</span></font>
</body>
</html>