[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Fri Feb 1 17:19:18 UTC 2019

Hi Yves, 

Sorry for not replying earlier and thanks for the update. Your plan sounds good, maybe you have a local library somewhere in your home dir that gets loaded first? 

Martin 

-- 
Martin Matthiesen 
CSC - Tieteen tietotekniikan keskus 
CSC - IT Center for Science 
PL 405, 02101 Espoo, Finland 
+358 9 457 2376, martin.matthiesen at csc.fi 
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 
Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704 

> From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
> To: "Martin Matthiesen" <martin.matthiesen at csc.fi>
> Cc: "Stephan Oepen" <oe at ifi.uio.no>, "infrastructure" <infrastructure at nlpl.eu>
> Sent: Wednesday, 30 January, 2019 09:41:17
> Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

> Hi,

> Just a quick update on this issue – I’m unfortunately running into it time and
> again… It all seems very inconsistent – the same scripts that worked in
> December don’t work anymore with another dataset now.

> The OpenNMT pipeline consists of three steps, data preprocessing, model training
> and translating. Sometimes the CUDA version error appears at the beginning of
> training, sometimes at the beginning of translating. It looks so far that
> putting everything into the same script (and thus forcing the three steps to be
> executed on the same GPU node) relieves the issue somewhat, but this kind of
> defeats the purpose of pretrained models…

> I’ll have my scripts tested by a colleague here, as it could be that my CSC
> account is somehow corrupted (I remember that Jörg could run some scripts fine
> while I got errors with them). I’ve also tried to clean my home directory from
> hidden configuration settings, but I might give that another shot when Taito is
> back running…

> @Stephan: Thanks for the update on winter school activities – I will prepare a
> quick walk-through of the MT activities.

> Best,

> Yves

> From: Scherrer, Yves
> Sent: Friday, December 21, 2018 12:49:39 PM
> To: Martin Matthiesen
> Cc: Stephan Oepen; infrastructure
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> Hi,

> In my tests, removing srun indeed resolves the issue. I will update my scripts
> accordingly, and from my point of view we can “close” this discussion, although
> the underlying reasons for this changing behavior of srun are still a bit
> unclear…

> Thanks for your help anyway!
> Yves

>> On 20 Dec 2018, at 10:05, Martin Matthiesen < [ mailto:martin.matthiesen at csc.fi
>> | martin.matthiesen at csc.fi ] > wrote:

>> Hi again,

>> Sorry, I accidentally hit send too early.

>> So my suspicion is that some environment setting is set slightly differently now
>> than it used to be and this affects srun. Is removing srun from the script
>> resolving the issue or is it only a workaround?

>> Martin

>>> From: "Martin Matthiesen" < [ mailto:martin.matthiesen at csc.fi |
>>> martin.matthiesen at csc.fi ] >
>>> To: "Yves Scherrer" < [ mailto:yves.scherrer at helsinki.fi |
>>> yves.scherrer at helsinki.fi ] >
>>> Cc: "Stephan Oepen" < [ mailto:oe at ifi.uio.no | oe at ifi.uio.no ] >,
>>> "infrastructure" < [ mailto:infrastructure at nlpl.eu | infrastructure at nlpl.eu ] >
>>> Sent: Thursday, 20 December, 2018 11:02:00
>>> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

>>> Hello Yves and all,

>>> Here's a summary of the basic differences between srun and sbatch:

>>> [
>>> https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters
>>> |
>>> https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters
>>> ]

>>> --
>>> Martin Matthiesen
>>> CSC - Tieteen tietotekniikan keskus
>>> CSC - IT Center for Science
>>> PL 405, 02101 Espoo, Finland
>>> +358 9 457 2376, [ mailto:martin.matthiesen at csc.fi | martin.matthiesen at csc.fi ]
>>> Public key : [ https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 |
>>> https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 ]
>>> Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704

>>>> From: "Yves Scherrer" < [ mailto:yves.scherrer at helsinki.fi |
>>>> yves.scherrer at helsinki.fi ] >
>>>> To: "Stephan Oepen" < [ mailto:oe at ifi.uio.no | oe at ifi.uio.no ] >
>>>> Cc: "Martin Matthiesen" < [ mailto:martin.matthiesen at csc.fi |
>>>> martin.matthiesen at csc.fi ] >, "infrastructure" < [
>>>> mailto:infrastructure at nlpl.eu | infrastructure at nlpl.eu ] >
>>>> Sent: Wednesday, 19 December, 2018 17:47:49
>>>> Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

>>>> It looks like the “srun” (present in my script, absent in Stephan’s) was the
>>>> culprit. I still have to say that I haven’t completely grasped its use – back
>>>> in Theano times, it was compulsory (at least for me, but Jörg was able to run
>>>> the same jobs without it), now it seems that it must be avoided…

>>>> Yves

>>>> From: Stephan Oepen < [ mailto:oe at ifi.uio.no | oe at ifi.uio.no ] >
>>>> Sent: Wednesday, December 19, 2018 3:48:05 PM
>>>> To: Scherrer, Yves
>>>> Cc: Martin Matthiesen; infrastructure
>>>> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>>>> this is weird: as if i did not get the error, yves? i had stripped
>>>> down your script to just the training; see

>>>> /homeappl/home/oe/onmt.sh

>>>> to confirm, i turned on that job once more earlier today (‘sbatch
>>>> onmt.sh’), and it appears to be training happily for now. standard
>>>> output and error from that job should be visible to you in my home
>>>> directory.

>>>> for ultimate comparability, could you also run that job

>>>> sbatch ~oe/onmt.sh

>>>> oe

>>>> On Tue, Dec 18, 2018 at 3:15 PM Scherrer, Yves
>>>> < [ mailto:yves.scherrer at helsinki.fi | yves.scherrer at helsinki.fi ] > wrote:

>>>> > Hi,

>>>>> My error occurs right away, I don’t even get these INFO messages… This is the
>>>> > full content of the training.*.err file:

>>>> > Loading application python-3.5.3 environment with needed modules

>>>>> THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=34 error=35 :
>>>> > CUDA driver version is insufficient for CUDA runtime version

>>>> > Traceback (most recent call last):

>>>>> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in
>>>> > <module>

>>>> > main(opt)

>>>> > File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main

>>>> > single_main(opt)

>>>>> File
>>>>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py",
>>>> > line 73, in main

>>>> > opt = training_opt_postprocessing(opt)

>>>>> File
>>>>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py",
>>>> > line 60, in training_opt_postprocessing

>>>> > torch.cuda.set_device(opt.device_id)

>>>>> File
>>>>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
>>>> > line 264, in set_device

>>>> > torch._C._cuda_setDevice(device)

>>>>> RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for
>>>> > CUDA runtime version at /pytorch/torch/csrc/cuda/Module.cpp:34

>>>>> Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at
>>>> > 0x7ff93231b400>

>>>> > Traceback (most recent call last):

>>>>> File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/weakref.py",
>>>> > line 117, in remove

>>>> > TypeError: 'NoneType' object is not callable

>>>> > srun: error: g110: task 0: Exited with exit code 1

>>>> > srun: Terminating job step 33310480.0

>>>> > ________________________________
>>>> > From: Stephan Oepen < [ mailto:oe at ifi.uio.no | oe at ifi.uio.no ] >
>>>> > Sent: Tuesday, December 18, 2018 2:56:49 PM
>>>> > To: Martin Matthiesen
>>>> > Cc: Scherrer, Yves; infrastructure
>>>> > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

>>>> > thanks for adjusting those permissions, yves!

>>>> > roughle how long into the job would you expect the error to occur?

>>>> > i have been running for around six minutes so far, and training
>>>> > appears to get going:

>>>> > 2018-12-18 14:47:43,683 INFO] encoder: 14116000
>>>> > [2018-12-18 14:47:43,683 INFO] decoder: 25862084
>>>> > [2018-12-18 14:47:43,683 INFO] * number of parameters: 39978084
>>>> > /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/_reduction.py:49:
>>>> > UserWarning: size_average and reduce args will be deprecated, please
>>>> > use reduction='sum' instead.
>>>> > warnings.warn(warning.format(ret))
>>>> > [2018-12-18 14:47:43,685 INFO] Start training...
>>>> > [2018-12-18 14:47:43,707 INFO] Loading train dataset from
>>>> > [ http://data.train.1.pt/ | data.train.1.pt ] , number of examples: 1030
>>>> > /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/functional.py:1320:
>>>> > UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
>>>> > warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
>>>> > [2018-12-18 14:49:15,649 INFO] Loading train dataset from
>>>> > [ http://data.train.10.pt/ | data.train.10.pt ] , number of examples: 1162
>>>> > [2018-12-18 14:50:55,474 INFO] Loading train dataset from
>>>> > [ http://data.train.100.pt/ | data.train.100.pt ] , number of examples: 1199
>>>> > [2018-12-18 14:52:13,191 INFO] Step 50/100000; acc: 5.83; ppl:
>>>> > 5884.51; xent: 8.68; lr: 1.00000; 272/262 tok/s; 269 sec
>>>> > [2018-12-18 14:52:38,496 INFO] Loading train dataset from
>>>> > [ http://data.train.1000.pt/ | data.train.1000.pt ] , number of examples: 1216

>>>> > but earlier you had sent a traceback involving a function called
>>>> > training_opt_postprocessing() ... so maybe the error ony occurs
>>>> > towards the end of training? which would seem pretty weird, seeing as
>>>> > i suppose PyTorch has been used extensively up to that point already?

>>>> > oe

>>>> > On Tue, Dec 18, 2018 at 10:11 AM Martin Matthiesen
>>>> > < [ mailto:martin.matthiesen at csc.fi | martin.matthiesen at csc.fi ] > wrote:

>>>> > > Hi,

>>>>> > I did try for an hour and a bit yesterday to pinpoint the problem, but could not
>>>>> > make head or tail of it. Did I understand correctly that Stephan, you got this
>>>> > > working on Taito?

>>>> > > Martin

>>>>> > P.S.: Should we keep infrastructure out of this or is this interesting to Jörg
>>>> > > and Björn?

>>>> > > --
>>>> > > Martin Matthiesen
>>>> > > CSC - Tieteen tietotekniikan keskus
>>>> > > CSC - IT Center for Science
>>>> > > PL 405, 02101 Espoo, Finland
>>>> > > +358 9 457 2376, [ mailto:martin.matthiesen at csc.fi | martin.matthiesen at csc.fi ]
>>>>> > Public key : [ https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 |
>>>> > > https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 ]
>>>> > > Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704

>>>> > > ________________________________

>>>>> > From: "Yves Scherrer" < [ mailto:yves.scherrer at helsinki.fi |
>>>> > > yves.scherrer at helsinki.fi ] >
>>>> > > To: "Stephan Oepen" < [ mailto:oe at ifi.uio.no | oe at ifi.uio.no ] >
>>>>> > Cc: "Martin Matthiesen" < [ mailto:martin.matthiesen at csc.fi |
>>>>> > martin.matthiesen at csc.fi ] >, "infrastructure" < [
>>>> > > mailto:infrastructure at nlpl.eu | infrastructure at nlpl.eu ] >
>>>> > > Sent: Tuesday, 18 December, 2018 10:35:25
>>>> > > Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

>>>> > > Hi,

>>>> > > > could you make the complete data directory group- or world-readable,
>>>> > > > so i can try running the ‘train.py’ script without creating my own
>>>> > > > copy of the data?

>>>> > > That should work now.

>>>> > > > thinking (possbily over-)optimistically, maybe the problem has
>>>> > > > magically disappeared already?

>>>>> > Unfortunately, it hasn’t. Or was I supposed to reinstall the OpenNMT-py module
>>>> > > locally?

>>>> > > Yves

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20190201/bd3c6298/attachment.htm>