[NLPL Task Force (A)] PyTorch on Taito gpu nodes

Mon Dec 10 21:35:05 UTC 2018

since ‘torchtext’ is in our current nlpl-opennmt-py installation, i
added its dependency requests to OpenNMT-py just now.  ultimately,
however, it sounds as if both maybe should be part of the base PyTorch
...

so please give it another shot!  oe

On Mon, Dec 10, 2018 at 10:26 PM Scherrer, Yves
<yves.scherrer at helsinki.fi> wrote:
>
> Hi Stephan,
>
> It almost works… I’m getting the following error now:
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torchtext/utils.py", line 2, in <module>
>     import requests
> ImportError: No module named ‘requests'
>
> The situation is the following:
> - The opennmt-py/0.2.1 environment contains ‘torchtext', but not its dependency ‘requests'.
> - The (current) pytorch/201811 environment contains neither ‘torchtext' nor its dependency ‘requests’.
> - The (old) pytorch/0.4.1 environment does not contain ‘torchtext', but contains ‘requests’, which explains why it worked before.
>
> I don’t know if it makes more sense to add the requests module to the pytorch environment or to the opennmt-py one...
>
> Best,
> Yves
>
> > On 10 Dec 2018, at 22:45, Stephan Oepen <oe at ifi.uio.no> wrote:
> >
> > hi yves,
> >
> > no, i have not heard more on this issue, but i would expect
> > nlpl-opennmt-py to work already now: it loads the default version of
> > nlpl-pytorch, which currently (still) is the one i installed recently,
> > i.e. the functional one.  could you give that a try please?
> >
> > best, oe
> >
> > On Mon, Dec 10, 2018 at 9:22 PM Scherrer, Yves
> > <yves.scherrer at helsinki.fi> wrote:
> >>
> >> Hi Stephan, all,
> >>
> >> Has there been any recent activity on this matter? Did you get any news from CSC about some internal changes they made? I am asking because I might need a working OpenNMT-py install soon for some of the December NLPL milestones…
> >>
> >> Best,
> >> Yves
> >>
> >>> On 28 Nov 2018, at 19:45, Stephan Oepen <oe at ifi.uio.no> wrote:
> >>>
> >>> hi yves and eivind,
> >>>
> >>> you both discovered independently today that the default NLPL
> >>> installation of PyTorch on Taito appears to have lost  its support for
> >>> gpu nodes.  the software has not changed since october, so i suspect
> >>> that some system-wide upgrade of the nvidia drivers or cuda libraries
> >>> may be the cause of these problems.  i have been unable to fully track
> >>> down what happened, but ...
> >>>
> >>> it (somewhat surprisingly) appears to be the case that re-doing my
> >>> original PyTorch installation recipe (using the exact same explicit
> >>> dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in
> >>> a functional PyTorch installation again.  for tonight, i am keeping
> >>> the original (gpu-dysfunctional) version for further debugging.  but
> >>> please try the following:
> >>>
> >>> $ module purge; module load nlpl-pytorch/201811
> >>> $ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \
> >>> python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> >>> True
> >>>
> >>> martin, do you think it is worth checking with the CSC cuda
> >>> wizards—they might be in a much better position to guess which
> >>> system-wide external parameters have changed recently?  to experience
> >>> the problem, replace the module version ‘201811’ with ‘0.4.1’; the
> >>> above test script should output False, i.e. cuda is not available.
> >>>
> >>> when debugging earlier today, yves observed that our installation of
> >>> OpenNMT-py (which is built on top of PyTorch) complains:
> >>>
> >>> File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
> >>> line 262, in set_device
> >>>
> >>>   torch._C._cuda_setDevice(device)
> >>>
> >>> RuntimeError: cuda runtime error (35) : CUDA driver version is
> >>> insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32
> >>>
> >>> i am all but certain that this is the same root problem, only when
> >>> trying to initialize from OpenNMT-py we somehow run into a full-blown
> >>> exception, whereas my PyTorch test script merely reports cuda as not
> >>> being available.
> >>>
> >>> i had originally promised myself and everyone who would listen that
> >>> module installations must remain unchanged, once announced publicly.
> >>> fixing what appears to be a fatal (if mysterious) flaw in our original
> >>> PyTorch 0.4.1 module, however, may present an exception to that
> >>> policy.  if we fail to work out what exactly caused the recent loss of
> >>> gpu support in that module in the next few days, i think i will just
> >>> wipe out the 0.4.1 installation and replace it with a fresh
> >>> re-installation (which must be expected to be functionally
> >>> equivalent).
> >>>
> >>> is everyone okay with that approach, in principle, infrastructure task force?
> >>>
> >>> cheers, oe
> >>
>