[NLPL Task Force (A)] PyTorch on Taito gpu nodes

Wed Nov 28 17:45:03 UTC 2018

hi yves and eivind,

you both discovered independently today that the default NLPL
installation of PyTorch on Taito appears to have lost  its support for
gpu nodes.  the software has not changed since october, so i suspect
that some system-wide upgrade of the nvidia drivers or cuda libraries
may be the cause of these problems.  i have been unable to fully track
down what happened, but ...

it (somewhat surprisingly) appears to be the case that re-doing my
original PyTorch installation recipe (using the exact same explicit
dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in
a functional PyTorch installation again.  for tonight, i am keeping
the original (gpu-dysfunctional) version for further debugging.  but
please try the following:

$ module purge; module load nlpl-pytorch/201811
$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \
  python3 /proj/nlpl/software/pytorch/0.4.1/test.py
True

martin, do you think it is worth checking with the CSC cuda
wizards—they might be in a much better position to guess which
system-wide external parameters have changed recently?  to experience
the problem, replace the module version ‘201811’ with ‘0.4.1’; the
above test script should output False, i.e. cuda is not available.

when debugging earlier today, yves observed that our installation of
OpenNMT-py (which is built on top of PyTorch) complains:

File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
line 262, in set_device

    torch._C._cuda_setDevice(device)

RuntimeError: cuda runtime error (35) : CUDA driver version is
insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32

i am all but certain that this is the same root problem, only when
trying to initialize from OpenNMT-py we somehow run into a full-blown
exception, whereas my PyTorch test script merely reports cuda as not
being available.

i had originally promised myself and everyone who would listen that
module installations must remain unchanged, once announced publicly.
fixing what appears to be a fatal (if mysterious) flaw in our original
PyTorch 0.4.1 module, however, may present an exception to that
policy.  if we fail to work out what exactly caused the recent loss of
gpu support in that module in the next few days, i think i will just
wipe out the 0.4.1 installation and replace it with a fresh
re-installation (which must be expected to be functionally
equivalent).

is everyone okay with that approach, in principle, infrastructure task force?

cheers, oe