[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Fri Dec 21 10:49:40 UTC 2018

Hi,

In my tests, removing srun indeed resolves the issue. I will update my scripts accordingly, and from my point of view we can “close” this discussion, although the underlying reasons for this changing behavior of srun are still a bit unclear…

Thanks for your help anyway!
Yves

On 20 Dec 2018, at 10:05, Martin Matthiesen <martin.matthiesen at csc.fi<mailto:martin.matthiesen at csc.fi>> wrote:

Hi again,

Sorry, I accidentally hit send too early.

So my suspicion is that some environment setting  is set slightly differently now than it used to be and this affects srun. Is removing srun from the script resolving the issue or is it only a workaround?

Martin

________________________________
From: "Martin Matthiesen" <martin.matthiesen at csc.fi<mailto:martin.matthiesen at csc.fi>>
To: "Yves Scherrer" <yves.scherrer at helsinki.fi<mailto:yves.scherrer at helsinki.fi>>
Cc: "Stephan Oepen" <oe at ifi.uio.no<mailto:oe at ifi.uio.no>>, "infrastructure" <infrastructure at nlpl.eu<mailto:infrastructure at nlpl.eu>>
Sent: Thursday, 20 December, 2018 11:02:00
Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
Hello Yves and all,

Here's a summary of the basic differences between srun and sbatch:

https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters

--
Martin Matthiesen
CSC - Tieteen tietotekniikan keskus
CSC - IT Center for Science
PL 405, 02101 Espoo, Finland
+358 9 457 2376, martin.matthiesen at csc.fi<mailto:martin.matthiesen at csc.fi>
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
Fingerprint: AA25 6F56 5C9A 8B42 009F  BA70 74B1 2876 FD89 0704

________________________________
From: "Yves Scherrer" <yves.scherrer at helsinki.fi<mailto:yves.scherrer at helsinki.fi>>
To: "Stephan Oepen" <oe at ifi.uio.no<mailto:oe at ifi.uio.no>>
Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi<mailto:martin.matthiesen at csc.fi>>, "infrastructure" <infrastructure at nlpl.eu<mailto:infrastructure at nlpl.eu>>
Sent: Wednesday, 19 December, 2018 17:47:49
Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
It looks like the “srun” (present in my script, absent in Stephan’s) was the culprit. I still have to say that I haven’t completely grasped its use – back in Theano times, it was compulsory (at least for me, but Jörg was able to run the same jobs without it), now it seems that it must be avoided…

Yves

________________________________
From: Stephan Oepen <oe at ifi.uio.no<mailto:oe at ifi.uio.no>>
Sent: Wednesday, December 19, 2018 3:48:05 PM
To: Scherrer, Yves
Cc: Martin Matthiesen; infrastructure
Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

this is weird: as if i did not get the error, yves?  i had stripped
down your script to just the training; see

/homeappl/home/oe/onmt.sh

to confirm, i turned on that job once more earlier today (‘sbatch
onmt.sh’), and it appears to be training happily for now.  standard
output and error from that job should be visible to you in my home
directory.

for ultimate comparability, could you also run that job

sbatch ~oe/onmt.sh

oe

On Tue, Dec 18, 2018 at 3:15 PM Scherrer, Yves
<yves.scherrer at helsinki.fi<mailto:yves.scherrer at helsinki.fi>> wrote:
>
> Hi,
>
>
>
> My error occurs right away, I don’t even get these INFO messages… This is the full content of the training.*.err file:
>
>
>
> Loading application python-3.5.3 environment with needed modules
>
> THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=34 error=35 : CUDA driver version is insufficient for CUDA runtime version
>
> Traceback (most recent call last):
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module>
>
>     main(opt)
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main
>
>     single_main(opt)
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main
>
>     opt = training_opt_postprocessing(opt)
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing
>
>     torch.cuda.set_device(opt.device_id)
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 264, in set_device
>
>     torch._C._cuda_setDevice(device)
>
> RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/torch/csrc/cuda/Module.cpp:34
>
> Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7ff93231b400>
>
> Traceback (most recent call last):
>
>   File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/weakref.py", line 117, in remove
>
> TypeError: 'NoneType' object is not callable
>
> srun: error: g110: task 0: Exited with exit code 1
>
> srun: Terminating job step 33310480.0
>
>
>
>
>
> ________________________________
> From: Stephan Oepen <oe at ifi.uio.no<mailto:oe at ifi.uio.no>>
> Sent: Tuesday, December 18, 2018 2:56:49 PM
> To: Martin Matthiesen
> Cc: Scherrer, Yves; infrastructure
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>
> thanks for adjusting those permissions, yves!
>
> roughle how long into the job would you expect the error to occur?
>
> i have been running for around six minutes so far, and training
> appears to get going:
>
> 2018-12-18 14:47:43,683 INFO] encoder: 14116000
> [2018-12-18 14:47:43,683 INFO] decoder: 25862084
> [2018-12-18 14:47:43,683 INFO] * number of parameters: 39978084
> /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/_reduction.py:49:
> UserWarning: size_average and reduce args will be deprecated, please
> use reduction='sum' instead.
>   warnings.warn(warning.format(ret))
> [2018-12-18 14:47:43,685 INFO] Start training...
> [2018-12-18 14:47:43,707 INFO] Loading train dataset from
> data.train.1.pt<http://data.train.1.pt>, number of examples: 1030
> /proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/functional.py:1320:
> UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
>   warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
> [2018-12-18 14:49:15,649 INFO] Loading train dataset from
> data.train.10.pt<http://data.train.10.pt>, number of examples: 1162
> [2018-12-18 14:50:55,474 INFO] Loading train dataset from
> data.train.100.pt<http://data.train.100.pt>, number of examples: 1199
> [2018-12-18 14:52:13,191 INFO] Step 50/100000; acc:   5.83; ppl:
> 5884.51; xent: 8.68; lr: 1.00000; 272/262 tok/s;    269 sec
> [2018-12-18 14:52:38,496 INFO] Loading train dataset from
> data.train.1000.pt<http://data.train.1000.pt>, number of examples: 1216
>
> but earlier you had sent a traceback involving a function called
> training_opt_postprocessing() ... so maybe the error ony occurs
> towards the end of training?  which would seem pretty weird, seeing as
> i suppose PyTorch has been used extensively up to that point already?
>
> oe
>
>
>
> On Tue, Dec 18, 2018 at 10:11 AM Martin Matthiesen
> <martin.matthiesen at csc.fi<mailto:martin.matthiesen at csc.fi>> wrote:
> >
> > Hi,
> >
> > I did try for an hour and a bit yesterday to pinpoint the problem, but could not make head or tail of it. Did I understand correctly that Stephan, you got this working on Taito?
> >
> > Martin
> >
> > P.S.: Should we keep infrastructure out of this or is this interesting to Jörg and Björn?
> >
> > --
> > Martin Matthiesen
> > CSC - Tieteen tietotekniikan keskus
> > CSC - IT Center for Science
> > PL 405, 02101 Espoo, Finland
> > +358 9 457 2376, martin.matthiesen at csc.fi<mailto:martin.matthiesen at csc.fi>
> > Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
> > Fingerprint: AA25 6F56 5C9A 8B42 009F  BA70 74B1 2876 FD89 0704
> >
> > ________________________________
> >
> > From: "Yves Scherrer" <yves.scherrer at helsinki.fi<mailto:yves.scherrer at helsinki.fi>>
> > To: "Stephan Oepen" <oe at ifi.uio.no<mailto:oe at ifi.uio.no>>
> > Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi<mailto:martin.matthiesen at csc.fi>>, "infrastructure" <infrastructure at nlpl.eu<mailto:infrastructure at nlpl.eu>>
> > Sent: Tuesday, 18 December, 2018 10:35:25
> > Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> >
> > Hi,
> >
> >
> >
> > > could you make the complete data directory group- or world-readable,
> > > so i can try running the ‘train.py’ script without creating my own
> > > copy of the data?
> >
> >
> >
> > That should work now.
> >
> >
> >
> > > thinking (possbily over-)optimistically, maybe the problem has
> > > magically disappeared already?
> >
> >
> >
> > Unfortunately, it hasn’t. Or was I supposed to reinstall the OpenNMT-py module locally?
> >
> >
> >
> > Yves
> >
> >
> >
> >

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20181221/ce712786/attachment.htm>