[NLPL Task Force (A)] PyTorch on Taito gpu nodes
Scherrer, Yves
yves.scherrer at helsinki.fi
Mon Dec 10 20:22:12 UTC 2018
Hi Stephan, all,
Has there been any recent activity on this matter? Did you get any news from CSC about some internal changes they made? I am asking because I might need a working OpenNMT-py install soon for some of the December NLPL milestones…
Best,
Yves
> On 28 Nov 2018, at 19:45, Stephan Oepen <oe at ifi.uio.no> wrote:
>
> hi yves and eivind,
>
> you both discovered independently today that the default NLPL
> installation of PyTorch on Taito appears to have lost its support for
> gpu nodes. the software has not changed since october, so i suspect
> that some system-wide upgrade of the nvidia drivers or cuda libraries
> may be the cause of these problems. i have been unable to fully track
> down what happened, but ...
>
> it (somewhat surprisingly) appears to be the case that re-doing my
> original PyTorch installation recipe (using the exact same explicit
> dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in
> a functional PyTorch installation again. for tonight, i am keeping
> the original (gpu-dysfunctional) version for further debugging. but
> please try the following:
>
> $ module purge; module load nlpl-pytorch/201811
> $ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \
> python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> True
>
> martin, do you think it is worth checking with the CSC cuda
> wizards—they might be in a much better position to guess which
> system-wide external parameters have changed recently? to experience
> the problem, replace the module version ‘201811’ with ‘0.4.1’; the
> above test script should output False, i.e. cuda is not available.
>
> when debugging earlier today, yves observed that our installation of
> OpenNMT-py (which is built on top of PyTorch) complains:
>
> File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
> line 262, in set_device
>
> torch._C._cuda_setDevice(device)
>
> RuntimeError: cuda runtime error (35) : CUDA driver version is
> insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32
>
> i am all but certain that this is the same root problem, only when
> trying to initialize from OpenNMT-py we somehow run into a full-blown
> exception, whereas my PyTorch test script merely reports cuda as not
> being available.
>
> i had originally promised myself and everyone who would listen that
> module installations must remain unchanged, once announced publicly.
> fixing what appears to be a fatal (if mysterious) flaw in our original
> PyTorch 0.4.1 module, however, may present an exception to that
> policy. if we fail to work out what exactly caused the recent loss of
> gpu support in that module in the next few days, i think i will just
> wipe out the 0.4.1 installation and replace it with a fresh
> re-installation (which must be expected to be functionally
> equivalent).
>
> is everyone okay with that approach, in principle, infrastructure task force?
>
> cheers, oe
More information about the infrastructure
mailing list