[NLPL Task Force (A)] PyTorch on Taito gpu nodes

Mon Dec 10 20:22:12 UTC 2018

Hi Stephan, all,

Has there been any recent activity on this matter? Did you get any news from CSC about some internal changes they made? I am asking because I might need a working OpenNMT-py install soon for some of the December NLPL milestones…

Best,
Yves

> On 28 Nov 2018, at 19:45, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
> hi yves and eivind,
> 
> you both discovered independently today that the default NLPL
> installation of PyTorch on Taito appears to have lost  its support for
> gpu nodes.  the software has not changed since october, so i suspect
> that some system-wide upgrade of the nvidia drivers or cuda libraries
> may be the cause of these problems.  i have been unable to fully track
> down what happened, but ...
> 
> it (somewhat surprisingly) appears to be the case that re-doing my
> original PyTorch installation recipe (using the exact same explicit
> dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in
> a functional PyTorch installation again.  for tonight, i am keeping
> the original (gpu-dysfunctional) version for further debugging.  but
> please try the following:
> 
> $ module purge; module load nlpl-pytorch/201811
> $ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \
>  python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> True
> 
> martin, do you think it is worth checking with the CSC cuda
> wizards—they might be in a much better position to guess which
> system-wide external parameters have changed recently?  to experience
> the problem, replace the module version ‘201811’ with ‘0.4.1’; the
> above test script should output False, i.e. cuda is not available.
> 
> when debugging earlier today, yves observed that our installation of
> OpenNMT-py (which is built on top of PyTorch) complains:
> 
> File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
> line 262, in set_device
> 
>    torch._C._cuda_setDevice(device)
> 
> RuntimeError: cuda runtime error (35) : CUDA driver version is
> insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32
> 
> i am all but certain that this is the same root problem, only when
> trying to initialize from OpenNMT-py we somehow run into a full-blown
> exception, whereas my PyTorch test script merely reports cuda as not
> being available.
> 
> i had originally promised myself and everyone who would listen that
> module installations must remain unchanged, once announced publicly.
> fixing what appears to be a fatal (if mysterious) flaw in our original
> PyTorch 0.4.1 module, however, may present an exception to that
> policy.  if we fail to work out what exactly caused the recent loss of
> gpu support in that module in the next few days, i think i will just
> wipe out the 0.4.1 installation and replace it with a fresh
> re-installation (which must be expected to be functionally
> equivalent).
> 
> is everyone okay with that approach, in principle, infrastructure task force?
> 
> cheers, oe