<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252"> <meta name="Generator" content="Microsoft Exchange Server"> <style></style> </head> <body> <style>  </style> <div lang="EN-US" link="blue" vlink="#954F72"> <div class="x_WordSection1"> Hi, My error occurs right away, I don’t even get these INFO messages… This is the full content of the training.*.err file: Loading application python-3.5.3 environment with needed modules THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=34 error=35 : CUDA driver version is insufficient for CUDA runtime version Traceback (most recent call last): File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module> main(opt) File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main single_main(opt) File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main opt = training_opt_postprocessing(opt) File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing torch.cuda.set_device(opt.device_id) File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 264, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/torch/csrc/cuda/Module.cpp:34 Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7ff93231b400> Traceback (most recent call last): File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/weakref.py", line 117, in remove TypeError: 'NoneType' object is not callable srun: error: g110: task 0: Exited with exit code 1 srun: Terminating job step 33310480.0 </div> <hr tabindex="-1" style="display:inline-block; width:98%"> <div id="x_divRplyFwdMsg" dir="ltr">From: Stephan Oepen <oe@ifi.uio.no> Sent: Tuesday, December 18, 2018 2:56:49 PM To: Martin Matthiesen Cc: Scherrer, Yves; infrastructure Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel) <div> </div> </div> </div> <div class="PlainText">thanks for adjusting those permissions, yves! roughle how long into the job would you expect the error to occur? i have been running for around six minutes so far, and training appears to get going: 2018-12-18 14:47:43,683 INFO] encoder: 14116000 [2018-12-18 14:47:43,683 INFO] decoder: 25862084 [2018-12-18 14:47:43,683 INFO] * number of parameters: 39978084 /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) [2018-12-18 14:47:43,685 INFO] Start training... [2018-12-18 14:47:43,707 INFO] Loading train dataset from data.train.1.pt, number of examples: 1030 /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/functional.py:1320: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead. warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.") [2018-12-18 14:49:15,649 INFO] Loading train dataset from data.train.10.pt, number of examples: 1162 [2018-12-18 14:50:55,474 INFO] Loading train dataset from data.train.100.pt, number of examples: 1199 [2018-12-18 14:52:13,191 INFO] Step 50/100000; acc: 5.83; ppl: 5884.51; xent: 8.68; lr: 1.00000; 272/262 tok/s; 269 sec [2018-12-18 14:52:38,496 INFO] Loading train dataset from data.train.1000.pt, number of examples: 1216 but earlier you had sent a traceback involving a function called training_opt_postprocessing() ... so maybe the error ony occurs towards the end of training? which would seem pretty weird, seeing as i suppose PyTorch has been used extensively up to that point already? oe On Tue, Dec 18, 2018 at 10:11 AM Martin Matthiesen <martin.matthiesen@csc.fi> wrote: > > Hi, > > I did try for an hour and a bit yesterday to pinpoint the problem, but could not make head or tail of it. Did I understand correctly that Stephan, you got this working on Taito? > > Martin > > P.S.: Should we keep infrastructure out of this or is this interesting to Jörg and Björn? > > -- > Martin Matthiesen > CSC - Tieteen tietotekniikan keskus > CSC - IT Center for Science > PL 405, 02101 Espoo, Finland > +358 9 457 2376, martin.matthiesen@csc.fi > Public key : <a href="https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704"> https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704</a> > Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704 > > ________________________________ > > From: "Yves Scherrer" <yves.scherrer@helsinki.fi> > To: "Stephan Oepen" <oe@ifi.uio.no> > Cc: "Martin Matthiesen" <martin.matthiesen@csc.fi>, "infrastructure" <infrastructure@nlpl.eu> > Sent: Tuesday, 18 December, 2018 10:35:25 > Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel) > > Hi, > > > > > could you make the complete data directory group- or world-readable, > > so i can try running the ‘train.py’ script without creating my own > > copy of the data? > > > > That should work now. > > > > > thinking (possbily over-)optimistically, maybe the problem has > > magically disappeared already? > > > > Unfortunately, it hasn’t. Or was I supposed to reinstall the OpenNMT-py module locally? > > > > Yves > > > > </div> </body> </html>