[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
Martin Matthiesen
martin.matthiesen at csc.fi
Fri Feb 1 17:19:18 UTC 2019
Hi Yves,
Sorry for not replying earlier and thanks for the update. Your plan sounds good, maybe you have a local library somewhere in your home dir that gets loaded first?
Martin
--
Martin Matthiesen
CSC - Tieteen tietotekniikan keskus
CSC - IT Center for Science
PL 405, 02101 Espoo, Finland
+358 9 457 2376, martin.matthiesen at csc.fi
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704
> From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
> To: "Martin Matthiesen" <martin.matthiesen at csc.fi>
> Cc: "Stephan Oepen" <oe at ifi.uio.no>, "infrastructure" <infrastructure at nlpl.eu>
> Sent: Wednesday, 30 January, 2019 09:41:17
> Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> Hi,
> Just a quick update on this issue – I’m unfortunately running into it time and
> again… It all seems very inconsistent – the same scripts that worked in
> December don’t work anymore with another dataset now.
> The OpenNMT pipeline consists of three steps, data preprocessing, model training
> and translating. Sometimes the CUDA version error appears at the beginning of
> training, sometimes at the beginning of translating. It looks so far that
> putting everything into the same script (and thus forcing the three steps to be
> executed on the same GPU node) relieves the issue somewhat, but this kind of
> defeats the purpose of pretrained models…
> I’ll have my scripts tested by a colleague here, as it could be that my CSC
> account is somehow corrupted (I remember that Jörg could run some scripts fine
> while I got errors with them). I’ve also tried to clean my home directory from
> hidden configuration settings, but I might give that another shot when Taito is
> back running…
> @Stephan: Thanks for the update on winter school activities – I will prepare a
> quick walk-through of the MT activities.
> Best,
> Yves
> From: Scherrer, Yves
> Sent: Friday, December 21, 2018 12:49:39 PM
> To: Martin Matthiesen
> Cc: Stephan Oepen; infrastructure
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> Hi,
> In my tests, removing srun indeed resolves the issue. I will update my scripts
> accordingly, and from my point of view we can “close” this discussion, although
> the underlying reasons for this changing behavior of srun are still a bit
> unclear…
> Thanks for your help anyway!
> Yves
>> On 20 Dec 2018, at 10:05, Martin Matthiesen < [ mailto:martin.matthiesen at csc.fi
>> | martin.matthiesen at csc.fi ] > wrote:
>> Hi again,
>> Sorry, I accidentally hit send too early.
>> So my suspicion is that some environment setting is set slightly differently now
>> than it used to be and this affects srun. Is removing srun from the script
>> resolving the issue or is it only a workaround?
>> Martin
>>> From: "Martin Matthiesen" < [ mailto:martin.matthiesen at csc.fi |
>>> martin.matthiesen at csc.fi ] >
>>> To: "Yves Scherrer" < [ mailto:yves.scherrer at helsinki.fi |
>>> yves.scherrer at helsinki.fi ] >
>>> Cc: "Stephan Oepen" < [ mailto:oe at ifi.uio.no | oe at ifi.uio.no ] >,
>>> "infrastructure" < [ mailto:infrastructure at nlpl.eu | infrastructure at nlpl.eu ] >
>>> Sent: Thursday, 20 December, 2018 11:02:00
>>> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>>> Hello Yves and all,
>>> Here's a summary of the basic differences between srun and sbatch:
>>> [
>>> https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters
>>> |
>>> https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters
>>> ]
>>> --
>>> Martin Matthiesen
>>> CSC - Tieteen tietotekniikan keskus
>>> CSC - IT Center for Science
>>> PL 405, 02101 Espoo, Finland
>>> +358 9 457 2376, [ mailto:martin.matthiesen at csc.fi | martin.matthiesen at csc.fi ]
>>> Public key : [ https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 |
>>> https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 ]
>>> Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704
>>>> From: "Yves Scherrer" < [ mailto:yves.scherrer at helsinki.fi |
>>>> yves.scherrer at helsinki.fi ] >
>>>> To: "Stephan Oepen" < [ mailto:oe at ifi.uio.no | oe at ifi.uio.no ] >
>>>> Cc: "Martin Matthiesen" < [ mailto:martin.matthiesen at csc.fi |
>>>> martin.matthiesen at csc.fi ] >, "infrastructure" < [
>>>> mailto:infrastructure at nlpl.eu | infrastructure at nlpl.eu ] >
>>>> Sent: Wednesday, 19 December, 2018 17:47:49
>>>> Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>>>> It looks like the “srun” (present in my script, absent in Stephan’s) was the
>>>> culprit. I still have to say that I haven’t completely grasped its use – back
>>>> in Theano times, it was compulsory (at least for me, but Jörg was able to run
>>>> the same jobs without it), now it seems that it must be avoided…
>>>> Yves
>>>> From: Stephan Oepen < [ mailto:oe at ifi.uio.no | oe at ifi.uio.no ] >
>>>> Sent: Wednesday, December 19, 2018 3:48:05 PM
>>>> To: Scherrer, Yves
>>>> Cc: Martin Matthiesen; infrastructure
>>>> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>>>> this is weird: as if i did not get the error, yves? i had stripped
>>>> down your script to just the training; see
>>>> /homeappl/home/oe/onmt.sh
>>>> to confirm, i turned on that job once more earlier today (‘sbatch
>>>> onmt.sh’), and it appears to be training happily for now. standard
>>>> output and error from that job should be visible to you in my home
>>>> directory.
>>>> for ultimate comparability, could you also run that job
>>>> sbatch ~oe/onmt.sh
>>>> oe
>>>> On Tue, Dec 18, 2018 at 3:15 PM Scherrer, Yves
>>>> < [ mailto:yves.scherrer at helsinki.fi | yves.scherrer at helsinki.fi ] > wrote:
>>>> > Hi,
>>>>> My error occurs right away, I don’t even get these INFO messages… This is the
>>>> > full content of the training.*.err file:
>>>> > Loading application python-3.5.3 environment with needed modules
>>>>> THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=34 error=35 :
>>>> > CUDA driver version is insufficient for CUDA runtime version
>>>> > Traceback (most recent call last):
>>>>> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in
>>>> > <module>
>>>> > main(opt)
>>>> > File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main
>>>> > single_main(opt)
>>>>> File
>>>>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py",
>>>> > line 73, in main
>>>> > opt = training_opt_postprocessing(opt)
>>>>> File
>>>>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py",
>>>> > line 60, in training_opt_postprocessing
>>>> > torch.cuda.set_device(opt.device_id)
>>>>> File
>>>>> "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
>>>> > line 264, in set_device
>>>> > torch._C._cuda_setDevice(device)
>>>>> RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for
>>>> > CUDA runtime version at /pytorch/torch/csrc/cuda/Module.cpp:34
>>>>> Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at
>>>> > 0x7ff93231b400>
>>>> > Traceback (most recent call last):
>>>>> File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/weakref.py",
>>>> > line 117, in remove
>>>> > TypeError: 'NoneType' object is not callable
>>>> > srun: error: g110: task 0: Exited with exit code 1
>>>> > srun: Terminating job step 33310480.0
>>>> > ________________________________
>>>> > From: Stephan Oepen < [ mailto:oe at ifi.uio.no | oe at ifi.uio.no ] >
>>>> > Sent: Tuesday, December 18, 2018 2:56:49 PM
>>>> > To: Martin Matthiesen
>>>> > Cc: Scherrer, Yves; infrastructure
>>>> > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>>>> > thanks for adjusting those permissions, yves!
>>>> > roughle how long into the job would you expect the error to occur?
>>>> > i have been running for around six minutes so far, and training
>>>> > appears to get going:
>>>> > 2018-12-18 14:47:43,683 INFO] encoder: 14116000
>>>> > [2018-12-18 14:47:43,683 INFO] decoder: 25862084
>>>> > [2018-12-18 14:47:43,683 INFO] * number of parameters: 39978084
>>>> > /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/_reduction.py:49:
>>>> > UserWarning: size_average and reduce args will be deprecated, please
>>>> > use reduction='sum' instead.
>>>> > warnings.warn(warning.format(ret))
>>>> > [2018-12-18 14:47:43,685 INFO] Start training...
>>>> > [2018-12-18 14:47:43,707 INFO] Loading train dataset from
>>>> > [ http://data.train.1.pt/ | data.train.1.pt ] , number of examples: 1030
>>>> > /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/functional.py:1320:
>>>> > UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
>>>> > warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
>>>> > [2018-12-18 14:49:15,649 INFO] Loading train dataset from
>>>> > [ http://data.train.10.pt/ | data.train.10.pt ] , number of examples: 1162
>>>> > [2018-12-18 14:50:55,474 INFO] Loading train dataset from
>>>> > [ http://data.train.100.pt/ | data.train.100.pt ] , number of examples: 1199
>>>> > [2018-12-18 14:52:13,191 INFO] Step 50/100000; acc: 5.83; ppl:
>>>> > 5884.51; xent: 8.68; lr: 1.00000; 272/262 tok/s; 269 sec
>>>> > [2018-12-18 14:52:38,496 INFO] Loading train dataset from
>>>> > [ http://data.train.1000.pt/ | data.train.1000.pt ] , number of examples: 1216
>>>> > but earlier you had sent a traceback involving a function called
>>>> > training_opt_postprocessing() ... so maybe the error ony occurs
>>>> > towards the end of training? which would seem pretty weird, seeing as
>>>> > i suppose PyTorch has been used extensively up to that point already?
>>>> > oe
>>>> > On Tue, Dec 18, 2018 at 10:11 AM Martin Matthiesen
>>>> > < [ mailto:martin.matthiesen at csc.fi | martin.matthiesen at csc.fi ] > wrote:
>>>> > > Hi,
>>>>> > I did try for an hour and a bit yesterday to pinpoint the problem, but could not
>>>>> > make head or tail of it. Did I understand correctly that Stephan, you got this
>>>> > > working on Taito?
>>>> > > Martin
>>>>> > P.S.: Should we keep infrastructure out of this or is this interesting to Jörg
>>>> > > and Björn?
>>>> > > --
>>>> > > Martin Matthiesen
>>>> > > CSC - Tieteen tietotekniikan keskus
>>>> > > CSC - IT Center for Science
>>>> > > PL 405, 02101 Espoo, Finland
>>>> > > +358 9 457 2376, [ mailto:martin.matthiesen at csc.fi | martin.matthiesen at csc.fi ]
>>>>> > Public key : [ https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 |
>>>> > > https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 ]
>>>> > > Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704
>>>> > > ________________________________
>>>>> > From: "Yves Scherrer" < [ mailto:yves.scherrer at helsinki.fi |
>>>> > > yves.scherrer at helsinki.fi ] >
>>>> > > To: "Stephan Oepen" < [ mailto:oe at ifi.uio.no | oe at ifi.uio.no ] >
>>>>> > Cc: "Martin Matthiesen" < [ mailto:martin.matthiesen at csc.fi |
>>>>> > martin.matthiesen at csc.fi ] >, "infrastructure" < [
>>>> > > mailto:infrastructure at nlpl.eu | infrastructure at nlpl.eu ] >
>>>> > > Sent: Tuesday, 18 December, 2018 10:35:25
>>>> > > Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>>>> > > Hi,
>>>> > > > could you make the complete data directory group- or world-readable,
>>>> > > > so i can try running the ‘train.py’ script without creating my own
>>>> > > > copy of the data?
>>>> > > That should work now.
>>>> > > > thinking (possbily over-)optimistically, maybe the problem has
>>>> > > > magically disappeared already?
>>>>> > Unfortunately, it hasn’t. Or was I supposed to reinstall the OpenNMT-py module
>>>> > > locally?
>>>> > > Yves
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20190201/bd3c6298/attachment.htm>
More information about the infrastructure
mailing list