[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Tue Dec 18 14:14:49 UTC 2018

Hi,

My error occurs right away, I don’t even get these INFO messages… This is the full content of the training.*.err file:

Loading application python-3.5.3 environment with needed modules

THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=34 error=35 : CUDA driver version is insufficient for CUDA runtime version

Traceback (most recent call last):

  File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module>

    main(opt)

  File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main

    single_main(opt)

  File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main

    opt = training_opt_postprocessing(opt)

  File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing

    torch.cuda.set_device(opt.device_id)

  File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 264, in set_device

    torch._C._cuda_setDevice(device)

RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/torch/csrc/cuda/Module.cpp:34

Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x7ff93231b400>

Traceback (most recent call last):

  File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/weakref.py", line 117, in remove

TypeError: 'NoneType' object is not callable

srun: error: g110: task 0: Exited with exit code 1

srun: Terminating job step 33310480.0

________________________________
From: Stephan Oepen <oe at ifi.uio.no>
Sent: Tuesday, December 18, 2018 2:56:49 PM
To: Martin Matthiesen
Cc: Scherrer, Yves; infrastructure
Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

thanks for adjusting those permissions, yves!

roughle how long into the job would you expect the error to occur?

i have been running for around six minutes so far, and training
appears to get going:

2018-12-18 14:47:43,683 INFO] encoder: 14116000
[2018-12-18 14:47:43,683 INFO] decoder: 25862084
[2018-12-18 14:47:43,683 INFO] * number of parameters: 39978084
/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/_reduction.py:49:
UserWarning: size_average and reduce args will be deprecated, please
use reduction='sum' instead.
  warnings.warn(warning.format(ret))
[2018-12-18 14:47:43,685 INFO] Start training...
[2018-12-18 14:47:43,707 INFO] Loading train dataset from
data.train.1.pt, number of examples: 1030
/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torch/nn/functional.py:1320:
UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
  warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
[2018-12-18 14:49:15,649 INFO] Loading train dataset from
data.train.10.pt, number of examples: 1162
[2018-12-18 14:50:55,474 INFO] Loading train dataset from
data.train.100.pt, number of examples: 1199
[2018-12-18 14:52:13,191 INFO] Step 50/100000; acc:   5.83; ppl:
5884.51; xent: 8.68; lr: 1.00000; 272/262 tok/s;    269 sec
[2018-12-18 14:52:38,496 INFO] Loading train dataset from
data.train.1000.pt, number of examples: 1216

but earlier you had sent a traceback involving a function called
training_opt_postprocessing() ... so maybe the error ony occurs
towards the end of training?  which would seem pretty weird, seeing as
i suppose PyTorch has been used extensively up to that point already?

oe

On Tue, Dec 18, 2018 at 10:11 AM Martin Matthiesen
<martin.matthiesen at csc.fi> wrote:
>
> Hi,
>
> I did try for an hour and a bit yesterday to pinpoint the problem, but could not make head or tail of it. Did I understand correctly that Stephan, you got this working on Taito?
>
> Martin
>
> P.S.: Should we keep infrastructure out of this or is this interesting to Jörg and Björn?
>
> --
> Martin Matthiesen
> CSC - Tieteen tietotekniikan keskus
> CSC - IT Center for Science
> PL 405, 02101 Espoo, Finland
> +358 9 457 2376, martin.matthiesen at csc.fi
> Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
> Fingerprint: AA25 6F56 5C9A 8B42 009F  BA70 74B1 2876 FD89 0704
>
> ________________________________
>
> From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
> To: "Stephan Oepen" <oe at ifi.uio.no>
> Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi>, "infrastructure" <infrastructure at nlpl.eu>
> Sent: Tuesday, 18 December, 2018 10:35:25
> Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>
> Hi,
>
>
>
> > could you make the complete data directory group- or world-readable,
> > so i can try running the ‘train.py’ script without creating my own
> > copy of the data?
>
>
>
> That should work now.
>
>
>
> > thinking (possbily over-)optimistically, maybe the problem has
> > magically disappeared already?
>
>
>
> Unfortunately, it hasn’t. Or was I supposed to reinstall the OpenNMT-py module locally?
>
>
>
> Yves
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20181218/ca1d204e/attachment.htm>