[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Wed Dec 19 13:51:39 UTC 2018

Thanks for looking into this – it really seems like something is going wrong on my side. I’ll have a look at your script and report back.

Yves

________________________________
From: Stephan Oepen <oe at ifi.uio.no>
Sent: Wednesday, December 19, 2018 3:48:05 PM
To: Scherrer, Yves
Cc: Martin Matthiesen; infrastructure
Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

this is weird: as if i did not get the error, yves?  i had stripped
down your script to just the training; see

/homeappl/home/oe/onmt.sh

to confirm, i turned on that job once more earlier today (‘sbatch
onmt.sh’), and it appears to be training happily for now.  standard
output and error from that job should be visible to you in my home
directory.

for ultimate comparability, could you also run that job

sbatch ~oe/onmt.sh

oe

On Tue, Dec 18, 2018 at 3:15 PM Scherrer, Yves
<yves.scherrer at helsinki.fi> wrote:
>
> Hi,
>
>
>
> My error occurs right away, I don’t even get these INFO messages… This is the full content of the training.*.err file:
>
>
>
> Loading application python-3.5.3 environment with needed modules
>
> THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=34 error=35 : CUDA driver version is insufficient for CUDA runtime version
>
> Traceback (most recent call last):
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module>
>
>     main(opt)
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main
>
>     single_main(opt)
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main
>
>     opt = training_opt_postprocessing(opt)
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing
>
>     torch.cuda.set_device(opt.device_id)
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 264, in set_device
>
>     torch._C._cuda_setDevice(device)
>
> RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/torch/csrc/cuda/Module.cpp:34
>
> Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7ff93231b400>
>
> Traceback (most recent call last):
>
>   File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/weakref.py", line 117, in remove
>
> TypeError: 'NoneType' object is not callable
>
> srun: error: g110: task 0: Exited with exit code 1
>
> srun: Terminating job step 33310480.0
>
>
>
>
>
> ________________________________
> From: Stephan Oepen <oe at ifi.uio.no>
> Sent: Tuesday, December 18, 2018 2:56:49 PM
> To: Martin Matthiesen
> Cc: Scherrer, Yves; infrastructure
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>
> thanks for adjusting those permissions, yves!
>
> roughle how long into the job would you expect the error to occur?
>
> i have been running for around six minutes so far, and training
> appears to get going:
>
> 2018-12-18 14:47:43,683 INFO] encoder: 14116000
> [2018-12-18 14:47:43,683 INFO] decoder: 25862084
> [2018-12-18 14:47:43,683 INFO] * number of parameters: 39978084
> /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/_reduction.py:49:
> UserWarning: size_average and reduce args will be deprecated, please
> use reduction='sum' instead.
>   warnings.warn(warning.format(ret))
> [2018-12-18 14:47:43,685 INFO] Start training...
> [2018-12-18 14:47:43,707 INFO] Loading train dataset from
> data.train.1.pt, number of examples: 1030
> /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/functional.py:1320:
> UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
>   warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
> [2018-12-18 14:49:15,649 INFO] Loading train dataset from
> data.train.10.pt, number of examples: 1162
> [2018-12-18 14:50:55,474 INFO] Loading train dataset from
> data.train.100.pt, number of examples: 1199
> [2018-12-18 14:52:13,191 INFO] Step 50/100000; acc:   5.83; ppl:
> 5884.51; xent: 8.68; lr: 1.00000; 272/262 tok/s;    269 sec
> [2018-12-18 14:52:38,496 INFO] Loading train dataset from
> data.train.1000.pt, number of examples: 1216
>
> but earlier you had sent a traceback involving a function called
> training_opt_postprocessing() ... so maybe the error ony occurs
> towards the end of training?  which would seem pretty weird, seeing as
> i suppose PyTorch has been used extensively up to that point already?
>
> oe
>
>
>
> On Tue, Dec 18, 2018 at 10:11 AM Martin Matthiesen
> <martin.matthiesen at csc.fi> wrote:
> >
> > Hi,
> >
> > I did try for an hour and a bit yesterday to pinpoint the problem, but could not make head or tail of it. Did I understand correctly that Stephan, you got this working on Taito?
> >
> > Martin
> >
> > P.S.: Should we keep infrastructure out of this or is this interesting to Jörg and Björn?
> >
> > --
> > Martin Matthiesen
> > CSC - Tieteen tietotekniikan keskus
> > CSC - IT Center for Science
> > PL 405, 02101 Espoo, Finland
> > +358 9 457 2376, martin.matthiesen at csc.fi
> > Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
> > Fingerprint: AA25 6F56 5C9A 8B42 009F  BA70 74B1 2876 FD89 0704
> >
> > ________________________________
> >
> > From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
> > To: "Stephan Oepen" <oe at ifi.uio.no>
> > Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi>, "infrastructure" <infrastructure at nlpl.eu>
> > Sent: Tuesday, 18 December, 2018 10:35:25
> > Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> >
> > Hi,
> >
> >
> >
> > > could you make the complete data directory group- or world-readable,
> > > so i can try running the ‘train.py’ script without creating my own
> > > copy of the data?
> >
> >
> >
> > That should work now.
> >
> >
> >
> > > thinking (possbily over-)optimistically, maybe the problem has
> > > magically disappeared already?
> >
> >
> >
> > Unfortunately, it hasn’t. Or was I supposed to reinstall the OpenNMT-py module locally?
> >
> >
> >
> > Yves
> >
> >
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20181219/2435b5a7/attachment.htm>