[NLPL Task Force (A)] PyTorch on Taito gpu nodes

Tue Dec 11 08:04:51 UTC 2018

Hi again,

OpenNMT-py picks up the new version of PyTorch, but the CUDA error that I signalled a few weeks back is still there:

THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=35 : CUDA driver version is insufficient for CUDA runtime version

Traceback (most recent call last):

  File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module>

    main(opt)

  File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main

    single_main(opt)

  File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main

    opt = training_opt_postprocessing(opt)

  File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing

    torch.cuda.set_device(opt.device_id)

  File "/proj/nlpl/software/pytorch/201811/lib/python3.5/site-packages/torch/cuda/__init__.py", line 262, in set_device

    torch._C._cuda_setDevice(device)

RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32

srun: error: g110: task 0: Exited with exit code 1

So in the end it looks like this error might be unrelated to the PyTorch error we’ve seen lately... Please let me know if I can help you with debugging this…

Yves

________________________________
From: Stephan Oepen <oe at ifi.uio.no>
Sent: Monday, December 10, 2018 11:35:05 PM
To: Scherrer, Yves
Cc: infrastructure
Subject: Re: PyTorch on Taito gpu nodes

since ‘torchtext’ is in our current nlpl-opennmt-py installation, i
added its dependency requests to OpenNMT-py just now.  ultimately,
however, it sounds as if both maybe should be part of the base PyTorch
...

so please give it another shot!  oe

On Mon, Dec 10, 2018 at 10:26 PM Scherrer, Yves
<yves.scherrer at helsinki.fi> wrote:
>
> Hi Stephan,
>
> It almost works… I’m getting the following error now:
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torchtext/utils.py", line 2, in <module>
>     import requests
> ImportError: No module named ‘requests'
>
> The situation is the following:
> - The opennmt-py/0.2.1 environment contains ‘torchtext', but not its dependency ‘requests'.
> - The (current) pytorch/201811 environment contains neither ‘torchtext' nor its dependency ‘requests’.
> - The (old) pytorch/0.4.1 environment does not contain ‘torchtext', but contains ‘requests’, which explains why it worked before.
>
> I don’t know if it makes more sense to add the requests module to the pytorch environment or to the opennmt-py one...
>
> Best,
> Yves
>
> > On 10 Dec 2018, at 22:45, Stephan Oepen <oe at ifi.uio.no> wrote:
> >
> > hi yves,
> >
> > no, i have not heard more on this issue, but i would expect
> > nlpl-opennmt-py to work already now: it loads the default version of
> > nlpl-pytorch, which currently (still) is the one i installed recently,
> > i.e. the functional one.  could you give that a try please?
> >
> > best, oe
> >
> > On Mon, Dec 10, 2018 at 9:22 PM Scherrer, Yves
> > <yves.scherrer at helsinki.fi> wrote:
> >>
> >> Hi Stephan, all,
> >>
> >> Has there been any recent activity on this matter? Did you get any news from CSC about some internal changes they made? I am asking because I might need a working OpenNMT-py install soon for some of the December NLPL milestones…
> >>
> >> Best,
> >> Yves
> >>
> >>> On 28 Nov 2018, at 19:45, Stephan Oepen <oe at ifi.uio.no> wrote:
> >>>
> >>> hi yves and eivind,
> >>>
> >>> you both discovered independently today that the default NLPL
> >>> installation of PyTorch on Taito appears to have lost  its support for
> >>> gpu nodes.  the software has not changed since october, so i suspect
> >>> that some system-wide upgrade of the nvidia drivers or cuda libraries
> >>> may be the cause of these problems.  i have been unable to fully track
> >>> down what happened, but ...
> >>>
> >>> it (somewhat surprisingly) appears to be the case that re-doing my
> >>> original PyTorch installation recipe (using the exact same explicit
> >>> dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in
> >>> a functional PyTorch installation again.  for tonight, i am keeping
> >>> the original (gpu-dysfunctional) version for further debugging.  but
> >>> please try the following:
> >>>
> >>> $ module purge; module load nlpl-pytorch/201811
> >>> $ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \
> >>> python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> >>> True
> >>>
> >>> martin, do you think it is worth checking with the CSC cuda
> >>> wizards—they might be in a much better position to guess which
> >>> system-wide external parameters have changed recently?  to experience
> >>> the problem, replace the module version ‘201811’ with ‘0.4.1’; the
> >>> above test script should output False, i.e. cuda is not available.
> >>>
> >>> when debugging earlier today, yves observed that our installation of
> >>> OpenNMT-py (which is built on top of PyTorch) complains:
> >>>
> >>> File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
> >>> line 262, in set_device
> >>>
> >>>   torch._C._cuda_setDevice(device)
> >>>
> >>> RuntimeError: cuda runtime error (35) : CUDA driver version is
> >>> insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32
> >>>
> >>> i am all but certain that this is the same root problem, only when
> >>> trying to initialize from OpenNMT-py we somehow run into a full-blown
> >>> exception, whereas my PyTorch test script merely reports cuda as not
> >>> being available.
> >>>
> >>> i had originally promised myself and everyone who would listen that
> >>> module installations must remain unchanged, once announced publicly.
> >>> fixing what appears to be a fatal (if mysterious) flaw in our original
> >>> PyTorch 0.4.1 module, however, may present an exception to that
> >>> policy.  if we fail to work out what exactly caused the recent loss of
> >>> gpu support in that module in the next few days, i think i will just
> >>> wipe out the 0.4.1 installation and replace it with a fresh
> >>> re-installation (which must be expected to be functionally
> >>> equivalent).
> >>>
> >>> is everyone okay with that approach, in principle, infrastructure task force?
> >>>
> >>> cheers, oe
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20181211/4ee5f73c/attachment.htm>