[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Martin Matthiesen martin.matthiesen at csc.fi
Thu Dec 20 09:05:11 UTC 2018


Hi again, 

Sorry, I accidentally hit send too early. 

So my suspicion is that some environment setting is set slightly differently now than it used to be and this affects srun. Is removing srun from the script resolving the issue or is it only a workaround? 

Martin 

> From: "Martin Matthiesen" <martin.matthiesen at csc.fi>
> To: "Yves Scherrer" <yves.scherrer at helsinki.fi>
> Cc: "Stephan Oepen" <oe at ifi.uio.no>, "infrastructure" <infrastructure at nlpl.eu>
> Sent: Thursday, 20 December, 2018 11:02:00
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

> Hello Yves and all,

> Here's a summary of the basic differences between srun and sbatch:

> https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters

> --
> Martin Matthiesen
> CSC - Tieteen tietotekniikan keskus
> CSC - IT Center for Science
> PL 405, 02101 Espoo, Finland
> +358 9 457 2376, martin.matthiesen at csc.fi
> Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
> Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704

>> From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
>> To: "Stephan Oepen" <oe at ifi.uio.no>
>> Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi>, "infrastructure"
>> <infrastructure at nlpl.eu>
>> Sent: Wednesday, 19 December, 2018 17:47:49
>> Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

>> It looks like the “srun” (present in my script, absent in Stephan’s) was the
>> culprit. I still have to say that I haven’t completely grasped its use – back
>> in Theano times, it was compulsory (at least for me, but Jörg was able to run
>> the same jobs without it), now it seems that it must be avoided…

>> Yves

>> From: Stephan Oepen <oe at ifi.uio.no>
>> Sent: Wednesday, December 19, 2018 3:48:05 PM
>> To: Scherrer, Yves
>> Cc: Martin Matthiesen; infrastructure
>> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>> this is weird: as if i did not get the error, yves? i had stripped
>> down your script to just the training; see

>> /homeappl/home/oe/onmt.sh

>> to confirm, i turned on that job once more earlier today (‘sbatch
>> onmt.sh’), and it appears to be training happily for now. standard
>> output and error from that job should be visible to you in my home
>> directory.

>> for ultimate comparability, could you also run that job

>> sbatch ~oe/onmt.sh

>> oe

>> On Tue, Dec 18, 2018 at 3:15 PM Scherrer, Yves
>> <yves.scherrer at helsinki.fi> wrote:

>> > Hi,



>>> My error occurs right away, I don’t even get these INFO messages… This is the
>> > full content of the training.*.err file:



>> > Loading application python-3.5.3 environment with needed modules

>>> THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=34 error=35 :
>> > CUDA driver version is insufficient for CUDA runtime version

>> > Traceback (most recent call last):

>>> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in
>> > <module>

>> > main(opt)

>> > File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main

>> > single_main(opt)

>>> File
>>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py",
>> > line 73, in main

>> > opt = training_opt_postprocessing(opt)

>>> File
>>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py",
>> > line 60, in training_opt_postprocessing

>> > torch.cuda.set_device(opt.device_id)

>>> File
>>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
>> > line 264, in set_device

>> > torch._C._cuda_setDevice(device)

>>> RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for
>> > CUDA runtime version at /pytorch/torch/csrc/cuda/Module.cpp:34

>>> Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at
>> > 0x7ff93231b400>

>> > Traceback (most recent call last):

>>> File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/weakref.py",
>> > line 117, in remove

>> > TypeError: 'NoneType' object is not callable

>> > srun: error: g110: task 0: Exited with exit code 1

>> > srun: Terminating job step 33310480.0





>> > ________________________________
>> > From: Stephan Oepen <oe at ifi.uio.no>
>> > Sent: Tuesday, December 18, 2018 2:56:49 PM
>> > To: Martin Matthiesen
>> > Cc: Scherrer, Yves; infrastructure
>> > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

>> > thanks for adjusting those permissions, yves!

>> > roughle how long into the job would you expect the error to occur?

>> > i have been running for around six minutes so far, and training
>> > appears to get going:

>> > 2018-12-18 14:47:43,683 INFO] encoder: 14116000
>> > [2018-12-18 14:47:43,683 INFO] decoder: 25862084
>> > [2018-12-18 14:47:43,683 INFO] * number of parameters: 39978084
>> > /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/_reduction.py:49:
>> > UserWarning: size_average and reduce args will be deprecated, please
>> > use reduction='sum' instead.
>> > warnings.warn(warning.format(ret))
>> > [2018-12-18 14:47:43,685 INFO] Start training...
>> > [2018-12-18 14:47:43,707 INFO] Loading train dataset from
>> > data.train.1.pt, number of examples: 1030
>> > /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/functional.py:1320:
>> > UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
>> > warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
>> > [2018-12-18 14:49:15,649 INFO] Loading train dataset from
>> > data.train.10.pt, number of examples: 1162
>> > [2018-12-18 14:50:55,474 INFO] Loading train dataset from
>> > data.train.100.pt, number of examples: 1199
>> > [2018-12-18 14:52:13,191 INFO] Step 50/100000; acc: 5.83; ppl:
>> > 5884.51; xent: 8.68; lr: 1.00000; 272/262 tok/s; 269 sec
>> > [2018-12-18 14:52:38,496 INFO] Loading train dataset from
>> > data.train.1000.pt, number of examples: 1216

>> > but earlier you had sent a traceback involving a function called
>> > training_opt_postprocessing() ... so maybe the error ony occurs
>> > towards the end of training? which would seem pretty weird, seeing as
>> > i suppose PyTorch has been used extensively up to that point already?

>> > oe



>> > On Tue, Dec 18, 2018 at 10:11 AM Martin Matthiesen
>> > <martin.matthiesen at csc.fi> wrote:

>> > > Hi,

>>> > I did try for an hour and a bit yesterday to pinpoint the problem, but could not
>>> > make head or tail of it. Did I understand correctly that Stephan, you got this
>> > > working on Taito?

>> > > Martin

>>> > P.S.: Should we keep infrastructure out of this or is this interesting to Jörg
>> > > and Björn?

>> > > --
>> > > Martin Matthiesen
>> > > CSC - Tieteen tietotekniikan keskus
>> > > CSC - IT Center for Science
>> > > PL 405, 02101 Espoo, Finland
>> > > +358 9 457 2376, martin.matthiesen at csc.fi
>> > > Public key : [ https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 |
>> https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 ]
>> > > Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704

>> > > ________________________________

>> > > From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
>> > > To: "Stephan Oepen" <oe at ifi.uio.no>
>>> > Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi>, "infrastructure"
>> > > <infrastructure at nlpl.eu>
>> > > Sent: Tuesday, 18 December, 2018 10:35:25
>> > > Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

>> > > Hi,



>> > > > could you make the complete data directory group- or world-readable,
>> > > > so i can try running the ‘train.py’ script without creating my own
>> > > > copy of the data?



>> > > That should work now.



>> > > > thinking (possbily over-)optimistically, maybe the problem has
>> > > > magically disappeared already?



>>> > Unfortunately, it hasn’t. Or was I supposed to reinstall the OpenNMT-py module
>> > > locally?



>> > > Yves




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20181220/df6f9e0c/attachment.htm>


More information about the infrastructure mailing list