[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Thu Dec 20 09:02:00 UTC 2018

Hello Yves and all, 

Here's a summary of the basic differences between srun and sbatch: 

https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters 

-- 
Martin Matthiesen 
CSC - Tieteen tietotekniikan keskus 
CSC - IT Center for Science 
PL 405, 02101 Espoo, Finland 
+358 9 457 2376, martin.matthiesen at csc.fi 
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 
Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704 

> From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
> To: "Stephan Oepen" <oe at ifi.uio.no>
> Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi>, "infrastructure"
> <infrastructure at nlpl.eu>
> Sent: Wednesday, 19 December, 2018 17:47:49
> Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

> It looks like the “srun” (present in my script, absent in Stephan’s) was the
> culprit. I still have to say that I haven’t completely grasped its use – back
> in Theano times, it was compulsory (at least for me, but Jörg was able to run
> the same jobs without it), now it seems that it must be avoided…

> Yves

> From: Stephan Oepen <oe at ifi.uio.no>
> Sent: Wednesday, December 19, 2018 3:48:05 PM
> To: Scherrer, Yves
> Cc: Martin Matthiesen; infrastructure
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> this is weird: as if i did not get the error, yves? i had stripped
> down your script to just the training; see

> /homeappl/home/oe/onmt.sh

> to confirm, i turned on that job once more earlier today (‘sbatch
> onmt.sh’), and it appears to be training happily for now. standard
> output and error from that job should be visible to you in my home
> directory.

> for ultimate comparability, could you also run that job

> sbatch ~oe/onmt.sh

> oe

> On Tue, Dec 18, 2018 at 3:15 PM Scherrer, Yves
> <yves.scherrer at helsinki.fi> wrote:

> > Hi,


>> My error occurs right away, I don’t even get these INFO messages… This is the
> > full content of the training.*.err file:


> > Loading application python-3.5.3 environment with needed modules

>> THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=34 error=35 :
> > CUDA driver version is insufficient for CUDA runtime version

> > Traceback (most recent call last):

>> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in
> > <module>

> > main(opt)

> > File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main

> > single_main(opt)

>> File
>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py",
> > line 73, in main

> > opt = training_opt_postprocessing(opt)

>> File
>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py",
> > line 60, in training_opt_postprocessing

> > torch.cuda.set_device(opt.device_id)

>> File
>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
> > line 264, in set_device

> > torch._C._cuda_setDevice(device)

>> RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for
> > CUDA runtime version at /pytorch/torch/csrc/cuda/Module.cpp:34

>> Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at
> > 0x7ff93231b400>

> > Traceback (most recent call last):

>> File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/weakref.py",
> > line 117, in remove

> > TypeError: 'NoneType' object is not callable

> > srun: error: g110: task 0: Exited with exit code 1

> > srun: Terminating job step 33310480.0


> > ________________________________
> > From: Stephan Oepen <oe at ifi.uio.no>
> > Sent: Tuesday, December 18, 2018 2:56:49 PM
> > To: Martin Matthiesen
> > Cc: Scherrer, Yves; infrastructure
> > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

> > thanks for adjusting those permissions, yves!

> > roughle how long into the job would you expect the error to occur?

> > i have been running for around six minutes so far, and training
> > appears to get going:

> > 2018-12-18 14:47:43,683 INFO] encoder: 14116000
> > [2018-12-18 14:47:43,683 INFO] decoder: 25862084
> > [2018-12-18 14:47:43,683 INFO] * number of parameters: 39978084
> > /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/_reduction.py:49:
> > UserWarning: size_average and reduce args will be deprecated, please
> > use reduction='sum' instead.
> > warnings.warn(warning.format(ret))
> > [2018-12-18 14:47:43,685 INFO] Start training...
> > [2018-12-18 14:47:43,707 INFO] Loading train dataset from
> > data.train.1.pt, number of examples: 1030
> > /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/functional.py:1320:
> > UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
> > warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
> > [2018-12-18 14:49:15,649 INFO] Loading train dataset from
> > data.train.10.pt, number of examples: 1162
> > [2018-12-18 14:50:55,474 INFO] Loading train dataset from
> > data.train.100.pt, number of examples: 1199
> > [2018-12-18 14:52:13,191 INFO] Step 50/100000; acc: 5.83; ppl:
> > 5884.51; xent: 8.68; lr: 1.00000; 272/262 tok/s; 269 sec
> > [2018-12-18 14:52:38,496 INFO] Loading train dataset from
> > data.train.1000.pt, number of examples: 1216

> > but earlier you had sent a traceback involving a function called
> > training_opt_postprocessing() ... so maybe the error ony occurs
> > towards the end of training? which would seem pretty weird, seeing as
> > i suppose PyTorch has been used extensively up to that point already?

> > oe


> > On Tue, Dec 18, 2018 at 10:11 AM Martin Matthiesen
> > <martin.matthiesen at csc.fi> wrote:

> > > Hi,

>> > I did try for an hour and a bit yesterday to pinpoint the problem, but could not
>> > make head or tail of it. Did I understand correctly that Stephan, you got this
> > > working on Taito?

> > > Martin

>> > P.S.: Should we keep infrastructure out of this or is this interesting to Jörg
> > > and Björn?

> > > --
> > > Martin Matthiesen
> > > CSC - Tieteen tietotekniikan keskus
> > > CSC - IT Center for Science
> > > PL 405, 02101 Espoo, Finland
> > > +358 9 457 2376, martin.matthiesen at csc.fi
> > > Public key : [ https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 |
> https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 ]
> > > Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704

> > > ________________________________

> > > From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
> > > To: "Stephan Oepen" <oe at ifi.uio.no>
>> > Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi>, "infrastructure"
> > > <infrastructure at nlpl.eu>
> > > Sent: Tuesday, 18 December, 2018 10:35:25
> > > Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

> > > Hi,


> > > > could you make the complete data directory group- or world-readable,
> > > > so i can try running the ‘train.py’ script without creating my own
> > > > copy of the data?


> > > That should work now.


> > > > thinking (possbily over-)optimistically, maybe the problem has
> > > > magically disappeared already?


>> > Unfortunately, it hasn’t. Or was I supposed to reinstall the OpenNMT-py module
> > > locally?


> > > Yves


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20181220/3ea30e66/attachment.htm>