<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p.xmsonormal, li.xmsonormal, div.xmsonormal
{mso-style-name:x_msonormal;
margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:8.5in 11.0in;
margin:70.85pt 56.7pt 70.85pt 56.7pt;}
div.WordSection1
{page:WordSection1;}
--></style>
<div class="WordSection1">
<p class="MsoNormal">Hi Martin,</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Yes, the problem still occurs. Please have a look at the train.sh SLURM script in /wrk/yvessche/onmt_test3 – this script worked fine when Stephan first installed OpenNMT, but has been failing in the last couple of weeks.</p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Best,</p>
<p class="MsoNormal">Yves</p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Martin Matthiesen <martin.matthiesen@csc.fi><br>
<b>Sent:</b> Monday, December 17, 2018 1:36:32 PM<br>
<b>To:</b> Scherrer, Yves<br>
<b>Cc:</b> Stephan Oepen; infrastructure<br>
<b>Subject:</b> Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)</font>
<div> </div>
</div>
<div>
<div style="font-family: arial, helvetica, sans-serif; font-size: 10pt; color: #000000">
<div><style>
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
p.x_MsoNormal, li.x_MsoNormal, div.x_MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
a:link, span.x_MsoHyperlink
{color:blue;
text-decoration:underline}
a:visited, span.x_MsoHyperlinkFollowed
{color:#954F72;
text-decoration:underline}
.x_MsoChpDefault
{}
@page WordSection1
{margin:70.85pt 56.7pt 70.85pt 56.7pt}
div.x_WordSection1
{}
-->
</style><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style></div>
<div>Hi Yves,<br>
</div>
<div><br data-mce-bogus="1">
</div>
<div>I am sorry for the long silence, I wanted to ask you: are your problems still current? If so, could you send me a hint how to reproduce?<br data-mce-bogus="1">
</div>
<div><br data-mce-bogus="1">
</div>
<div>Regards,<br data-mce-bogus="1">
</div>
<div>Martin<br data-mce-bogus="1">
</div>
<div><br>
</div>
<div data-marker="__SIG_PRE__">-- <br>
Martin Matthiesen<br>
CSC - Tieteen tietotekniikan keskus<br>
CSC - IT Center for Science<br>
PL 405, 02101 Espoo, Finland<br>
+358 9 457 2376, martin.matthiesen@csc.fi<br>
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704<br>
Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704</div>
<div><br>
</div>
<hr id="zwchr" data-marker="__DIVIDER__">
<div data-marker="__HEADERS__">
<blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;">
<b>From: </b>"Yves Scherrer" <yves.scherrer@helsinki.fi><br>
<b>To: </b>"Stephan Oepen" <oe@ifi.uio.no><br>
<b>Cc: </b>"Martin Matthiesen" <martin.matthiesen@csc.fi>, "infrastructure" <infrastructure@nlpl.eu><br>
<b>Sent: </b>Wednesday, 28 November, 2018 16:55:27<br>
<b>Subject: </b>RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)<br>
</blockquote>
</div>
<div data-marker="__QUOTED_TEXT__">
<blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;">
<div lang="EN-US">
<div class="x_WordSection1">
<p class="x_MsoNormal">Curiously enough, when I run the OpenNMT training script that worked fine a month ago, I get this error now:</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=35 : CUDA driver version is insufficient for CUDA runtime version</p>
<p class="x_MsoNormal">Traceback (most recent call last):</p>
<p class="x_MsoNormal"> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module></p>
<p class="x_MsoNormal"> main(opt)</p>
<p class="x_MsoNormal"> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main</p>
<p class="x_MsoNormal"> single_main(opt)</p>
<p class="x_MsoNormal"> File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main</p>
<p class="x_MsoNormal"> opt = training_opt_postprocessing(opt)</p>
<p class="x_MsoNormal"> File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing</p>
<p class="x_MsoNormal"> torch.cuda.set_device(opt.device_id)</p>
<p class="x_MsoNormal"> File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 262, in set_device</p>
<p class="x_MsoNormal"> torch._C._cuda_setDevice(device)</p>
<p class="x_MsoNormal">RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">I have no idea if this is related to the PyTorch issue, but could it be that some CUDA code got updated on Taito in the meantime?</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Yves</p>
<p class="x_MsoNormal"> </p>
</div>
<hr style="display:inline-block; width:98%">
<div id="x_divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> Stephan Oepen <oe@ifi.uio.no><br>
<b>Sent:</b> Wednesday, November 28, 2018 4:43:47 PM<br>
<b>To:</b> Scherrer, Yves<br>
<b>Cc:</b> Martin Matthiesen; infrastructure<br>
<b>Subject:</b> Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)</font>
<div> </div>
</div>
</div>
<font size="2"><span style="font-size:11pt;">
<div class="PlainText">well, then it should not be too hard to get the PyTorch installation<br>
on Taito to work on the gpu nodes :-).<br>
<br>
i will have a look now ...<br>
<br>
oe<br>
<br>
On Wed, Nov 28, 2018 at 3:38 PM Scherrer, Yves<br>
<yves.scherrer@helsinki.fi> wrote:<br>
><br>
> Hi,<br>
><br>
><br>
><br>
> I did my OpenNMT-py experiments on both Abel and Taito.<br>
><br>
> On Taito, I got training speeds of about 13000 tokens/s, on Abel it was about 4000 tokens/s.<br>
><br>
> A colleague who used an independent OpenNMT-py module on Taito-GPU during the summer obtained about 9000 tokens/s with a different dataset.<br>
><br>
> I also just started a CPU-only training run on Taito, which got around 1000 tokens/s.<br>
><br>
> This leads me to believe that my experiments – at least those on Taito – did use the GPU…<br>
><br>
><br>
><br>
> Best,<br>
><br>
> Yves<br>
><br>
><br>
><br>
> ________________________________<br>
> From: Stephan Oepen <oe@ifi.uio.no><br>
> Sent: Wednesday, November 28, 2018 4:08:46 PM<br>
> To: Scherrer, Yves<br>
> Cc: Martin Matthiesen; infrastructure<br>
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)<br>
><br>
> as for the OpenNMT-py experiments, did you do those on Abel or Taito,<br>
> or both? using gpus on Taito? in other words, do you believe that<br>
> OpenNMT-py (in contrast to PyTorch) works on Taito gpu nodes?<br>
><br>
> oe<br>
><br>
> On Wed, Nov 28, 2018 at 2:47 PM Scherrer, Yves<br>
> <yves.scherrer@helsinki.fi> wrote:<br>
> ><br>
> > Hi,<br>
> ><br>
> ><br>
> ><br>
> > I’m following up on this one with a related issue. I am testing PyTorch independently of OpenNMT-py, but cannot get it to run on (Taito-)GPU.<br>
> ><br>
> ><br>
> ><br>
> > Specifically, although I was logged in to Taito-GPU, I cannot get the test script described on the Wiki page to return True:<br>
> ><br>
> ><br>
> ><br>
> > [GPU-Env lstmtagger]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 --pty python3 /proj/nlpl/software/pytorch/0.4.1/test.py<br>
> ><br>
> > srun: job 32089470 queued and waiting for resources<br>
> ><br>
> > srun: job 32089470 has been allocated resources<br>
> ><br>
> > False<br>
> ><br>
> ><br>
> ><br>
> > I also get ‘False’ when running the following script through sbatch:<br>
> ><br>
> ><br>
> ><br>
> > #SBATCH -J cudatest<br>
> ><br>
> > #SBATCH -o cudatest.%j.out<br>
> ><br>
> > #SBATCH -e cudatest.%j.err<br>
> ><br>
> > #SBATCH -t 0:05:00<br>
> ><br>
> > #SBATCH -p gputest<br>
> ><br>
> > #SBATCH -N 1<br>
> ><br>
> > #SBATCH --gres=gpu:k80:1<br>
> ><br>
> > #SBATCH --mem=1g<br>
> ><br>
> > module use -a /proj/nlpl/software/modulefiles/<br>
> ><br>
> > module load nlpl-pytorch<br>
> ><br>
> > srun python3 /proj/nlpl/software/pytorch/0.4.1/test.py<br>
> ><br>
> ><br>
> ><br>
> > Has there been any change lately? Or am I missing something obvious?<br>
> ><br>
> ><br>
> ><br>
> > Best,<br>
> ><br>
> > Yves<br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > ________________________________<br>
> > From: Stephan Oepen <oe@ifi.uio.no><br>
> > Sent: Wednesday, September 26, 2018 11:10:12 PM<br>
> > To: Scherrer, Yves<br>
> > Cc: Martin Matthiesen; infrastructure<br>
> > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)<br>
> ><br>
> > hi again,<br>
> ><br>
> > > i actually had a go at my own glibc and PyTorch installations on Taito, but<br>
> > > so far gpu support is evasive.<br>
> ><br>
> > actually, with a little more tinkering, i now believe i might have a<br>
> > working installation of PyTorch 0.4.1 and OpenNMT-py 0.2.1 on Taito<br>
> > too, seemingly functional on both cpu and gpu nodes:<br>
> ><br>
> > [oe@taito-login4 ~]$ module purge<br>
> > [oe@taito-login4 ~]$ module load nlpl-opennmt-py<br>
> > Loading application python-3.5.3 environment with needed modules<br>
> > [oe@taito-login4 ~]$ module list<br>
> ><br>
> > Currently Loaded Modules:<br>
> > 1) gcc/5.4.0 2) intelmpi/5.1.3 3) mkl/11.3.2 4) python/3.5.3<br>
> > 5) python-env/3.5.3 6) nlpl-pytorch/0.4.1 7) nlpl-opennmt-py/0.2.1<br>
> ><br>
> > [oe@taito-login4 ~]$ type -all python<br>
> > python is /proj/nlpl/software/opennmt-py/0.2.1/bin/python<br>
> > python is /proj/nlpl/software/pytorch/0.4.1/bin/python<br>
> > python is /appl/opt/python/3.5.3-gnu540/bin/python<br>
> > python is /usr/bin/python<br>
> > [oe@taito-login4 ~]$ python -c "import torch; import onmt;<br>
> > print(torch.cuda.is_available());"<br>
> > False<br>
> ><br>
> > [oe@taito-login4 ~]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t<br>
> > 15 --pty \<br>
> > python -c "import torch; import onmt; print(torch.cuda.is_available());"<br>
> > True<br>
> ><br>
> > —yves (or joerg), i would have a hard time testing things in much more<br>
> > depth. any chance you would have some time to try and replicate the<br>
> > validation steps your are currently running on Abel on Taito too?<br>
> ><br>
> > with a sense of accomplishment :-), oe<br>
</div>
</span></font><br>
</blockquote>
</div>
</div>
</div>
</body>
</html>