[NLPL Task Force (A)] PyTorch on Taito gpu nodes

Mon Dec 10 20:45:44 UTC 2018

hi yves,

no, i have not heard more on this issue, but i would expect
nlpl-opennmt-py to work already now: it loads the default version of
nlpl-pytorch, which currently (still) is the one i installed recently,
i.e. the functional one.  could you give that a try please?

best, oe

On Mon, Dec 10, 2018 at 9:22 PM Scherrer, Yves
<yves.scherrer at helsinki.fi> wrote:
>
> Hi Stephan, all,
>
> Has there been any recent activity on this matter? Did you get any news from CSC about some internal changes they made? I am asking because I might need a working OpenNMT-py install soon for some of the December NLPL milestones…
>
> Best,
> Yves
>
> > On 28 Nov 2018, at 19:45, Stephan Oepen <oe at ifi.uio.no> wrote:
> >
> > hi yves and eivind,
> >
> > you both discovered independently today that the default NLPL
> > installation of PyTorch on Taito appears to have lost  its support for
> > gpu nodes.  the software has not changed since october, so i suspect
> > that some system-wide upgrade of the nvidia drivers or cuda libraries
> > may be the cause of these problems.  i have been unable to fully track
> > down what happened, but ...
> >
> > it (somewhat surprisingly) appears to be the case that re-doing my
> > original PyTorch installation recipe (using the exact same explicit
> > dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in
> > a functional PyTorch installation again.  for tonight, i am keeping
> > the original (gpu-dysfunctional) version for further debugging.  but
> > please try the following:
> >
> > $ module purge; module load nlpl-pytorch/201811
> > $ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \
> >  python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> > True
> >
> > martin, do you think it is worth checking with the CSC cuda
> > wizards—they might be in a much better position to guess which
> > system-wide external parameters have changed recently?  to experience
> > the problem, replace the module version ‘201811’ with ‘0.4.1’; the
> > above test script should output False, i.e. cuda is not available.
> >
> > when debugging earlier today, yves observed that our installation of
> > OpenNMT-py (which is built on top of PyTorch) complains:
> >
> > File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
> > line 262, in set_device
> >
> >    torch._C._cuda_setDevice(device)
> >
> > RuntimeError: cuda runtime error (35) : CUDA driver version is
> > insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32
> >
> > i am all but certain that this is the same root problem, only when
> > trying to initialize from OpenNMT-py we somehow run into a full-blown
> > exception, whereas my PyTorch test script merely reports cuda as not
> > being available.
> >
> > i had originally promised myself and everyone who would listen that
> > module installations must remain unchanged, once announced publicly.
> > fixing what appears to be a fatal (if mysterious) flaw in our original
> > PyTorch 0.4.1 module, however, may present an exception to that
> > policy.  if we fail to work out what exactly caused the recent loss of
> > gpu support in that module in the next few days, i think i will just
> > wipe out the 0.4.1 installation and replace it with a fresh
> > re-installation (which must be expected to be functionally
> > equivalent).
> >
> > is everyone okay with that approach, in principle, infrastructure task force?
> >
> > cheers, oe
>