<html><body><div style="font-family: arial, helvetica, sans-serif; font-size: 10pt; color: #000000"><div><style>  </style><style></style></div><div>Hi Yves, </div><div> </div><div>I am sorry for the long silence, I wanted to ask you: are your problems still current? If so, could you send me a hint how to reproduce? </div><div> </div><div>Regards, </div><div>Martin </div><div> </div><div data-marker="__SIG_PRE__">-- Martin Matthiesen CSC - Tieteen tietotekniikan keskus CSC - IT Center for Science PL 405, 02101 Espoo, Finland +358 9 457 2376, martin.matthiesen@csc.fi Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704</div><div> </div><hr id="zwchr" data-marker="__DIVIDER__"><div data-marker="__HEADERS__"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;">From: "Yves Scherrer" <yves.scherrer@helsinki.fi> To: "Stephan Oepen" <oe@ifi.uio.no> Cc: "Martin Matthiesen" <martin.matthiesen@csc.fi>, "infrastructure" <infrastructure@nlpl.eu> Sent: Wednesday, 28 November, 2018 16:55:27 Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel) </blockquote></div><div data-marker="__QUOTED_TEXT__"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"> <div lang="EN-US"> <div class="x_WordSection1"> Curiously enough, when I run the OpenNMT training script that worked fine a month ago, I get this error now: THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=35 : CUDA driver version is insufficient for CUDA runtime version Traceback (most recent call last): File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module> main(opt) File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main single_main(opt) File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main opt = training_opt_postprocessing(opt) File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing torch.cuda.set_device(opt.device_id) File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 262, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32 I have no idea if this is related to the PyTorch issue, but could it be that some CUDA code got updated on Taito in the meantime? Yves </div> <hr style="display:inline-block; width:98%"> <div id="x_divRplyFwdMsg" dir="ltr">From: Stephan Oepen <oe@ifi.uio.no> Sent: Wednesday, November 28, 2018 4:43:47 PM To: Scherrer, Yves Cc: Martin Matthiesen; infrastructure Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel) <div> </div> </div> </div> <div class="PlainText">well, then it should not be too hard to get the PyTorch installation on Taito to work on the gpu nodes :-). i will have a look now ... oe On Wed, Nov 28, 2018 at 3:38 PM Scherrer, Yves <yves.scherrer@helsinki.fi> wrote: > > Hi, > > > > I did my OpenNMT-py experiments on both Abel and Taito. > > On Taito, I got training speeds of about 13000 tokens/s, on Abel it was about 4000 tokens/s. > > A colleague who used an independent OpenNMT-py module on Taito-GPU during the summer obtained about 9000 tokens/s with a different dataset. > > I also just started a CPU-only training run on Taito, which got around 1000 tokens/s. > > This leads me to believe that my experiments – at least those on Taito – did use the GPU… > > > > Best, > > Yves > > > > ________________________________ > From: Stephan Oepen <oe@ifi.uio.no> > Sent: Wednesday, November 28, 2018 4:08:46 PM > To: Scherrer, Yves > Cc: Martin Matthiesen; infrastructure > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel) > > as for the OpenNMT-py experiments, did you do those on Abel or Taito, > or both? using gpus on Taito? in other words, do you believe that > OpenNMT-py (in contrast to PyTorch) works on Taito gpu nodes? > > oe > > On Wed, Nov 28, 2018 at 2:47 PM Scherrer, Yves > <yves.scherrer@helsinki.fi> wrote: > > > > Hi, > > > > > > > > I’m following up on this one with a related issue. I am testing PyTorch independently of OpenNMT-py, but cannot get it to run on (Taito-)GPU. > > > > > > > > Specifically, although I was logged in to Taito-GPU, I cannot get the test script described on the Wiki page to return True: > > > > > > > > [GPU-Env lstmtagger]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 --pty python3 /proj/nlpl/software/pytorch/0.4.1/test.py > > > > srun: job 32089470 queued and waiting for resources > > > > srun: job 32089470 has been allocated resources > > > > False > > > > > > > > I also get ‘False’ when running the following script through sbatch: > > > > > > > > #SBATCH -J cudatest > > > > #SBATCH -o cudatest.%j.out > > > > #SBATCH -e cudatest.%j.err > > > > #SBATCH -t 0:05:00 > > > > #SBATCH -p gputest > > > > #SBATCH -N 1 > > > > #SBATCH --gres=gpu:k80:1 > > > > #SBATCH --mem=1g > > > > module use -a /proj/nlpl/software/modulefiles/ > > > > module load nlpl-pytorch > > > > srun python3 /proj/nlpl/software/pytorch/0.4.1/test.py > > > > > > > > Has there been any change lately? Or am I missing something obvious? > > > > > > > > Best, > > > > Yves > > > > > > > > > > > > ________________________________ > > From: Stephan Oepen <oe@ifi.uio.no> > > Sent: Wednesday, September 26, 2018 11:10:12 PM > > To: Scherrer, Yves > > Cc: Martin Matthiesen; infrastructure > > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel) > > > > hi again, > > > > > i actually had a go at my own glibc and PyTorch installations on Taito, but > > > so far gpu support is evasive. > > > > actually, with a little more tinkering, i now believe i might have a > > working installation of PyTorch 0.4.1 and OpenNMT-py 0.2.1 on Taito > > too, seemingly functional on both cpu and gpu nodes: > > > > [oe@taito-login4 ~]$ module purge > > [oe@taito-login4 ~]$ module load nlpl-opennmt-py > > Loading application python-3.5.3 environment with needed modules > > [oe@taito-login4 ~]$ module list > > > > Currently Loaded Modules: > > 1) gcc/5.4.0 2) intelmpi/5.1.3 3) mkl/11.3.2 4) python/3.5.3 > > 5) python-env/3.5.3 6) nlpl-pytorch/0.4.1 7) nlpl-opennmt-py/0.2.1 > > > > [oe@taito-login4 ~]$ type -all python > > python is /proj/nlpl/software/opennmt-py/0.2.1/bin/python > > python is /proj/nlpl/software/pytorch/0.4.1/bin/python > > python is /appl/opt/python/3.5.3-gnu540/bin/python > > python is /usr/bin/python > > [oe@taito-login4 ~]$ python -c "import torch; import onmt; > > print(torch.cuda.is_available());" > > False > > > > [oe@taito-login4 ~]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t > > 15 --pty \ > > python -c "import torch; import onmt; print(torch.cuda.is_available());" > > True > > > > —yves (or joerg), i would have a hard time testing things in much more > > depth. any chance you would have some time to try and replicate the > > validation steps your are currently running on Abel on Taito too? > > > > with a sense of accomplishment :-), oe </div> </blockquote></div></div></body></html>