<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252"> <meta name="Generator" content="Microsoft Exchange Server"> <style></style> </head> <body> <style>  </style> <div lang="EN-US" link="blue" vlink="#954F72"> <div class="x_WordSection1"> Hi, Thanks for looking into this, Stephan. I confirm that the 201811 version of PyTorch is running correctly on GPU. I have no objection about replacing the current 0.4.1 installation with a working one. Best, Yves </div> <hr tabindex="-1" style="display:inline-block; width:98%"> <div id="x_divRplyFwdMsg" dir="ltr">From: Stephan Oepen <oe@ifi.uio.no> Sent: Wednesday, November 28, 2018 7:45:03 PM To: Scherrer, Yves; Eivind Alexander Bergem Cc: infrastructure Subject: PyTorch on Taito gpu nodes <div> </div> </div> </div> <div class="PlainText">hi yves and eivind, you both discovered independently today that the default NLPL installation of PyTorch on Taito appears to have lost its support for gpu nodes. the software has not changed since october, so i suspect that some system-wide upgrade of the nvidia drivers or cuda libraries may be the cause of these problems. i have been unable to fully track down what happened, but ... it (somewhat surprisingly) appears to be the case that re-doing my original PyTorch installation recipe (using the exact same explicit dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in a functional PyTorch installation again. for tonight, i am keeping the original (gpu-dysfunctional) version for further debugging. but please try the following: $ module purge; module load nlpl-pytorch/201811 $ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \ python3 /proj/nlpl/software/pytorch/0.4.1/test.py True martin, do you think it is worth checking with the CSC cuda wizards—they might be in a much better position to guess which system-wide external parameters have changed recently? to experience the problem, replace the module version ‘201811’ with ‘0.4.1’; the above test script should output False, i.e. cuda is not available. when debugging earlier today, yves observed that our installation of OpenNMT-py (which is built on top of PyTorch) complains: File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 262, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32 i am all but certain that this is the same root problem, only when trying to initialize from OpenNMT-py we somehow run into a full-blown exception, whereas my PyTorch test script merely reports cuda as not being available. i had originally promised myself and everyone who would listen that module installations must remain unchanged, once announced publicly. fixing what appears to be a fatal (if mysterious) flaw in our original PyTorch 0.4.1 module, however, may present an exception to that policy. if we fail to work out what exactly caused the recent loss of gpu support in that module in the next few days, i think i will just wipe out the 0.4.1 installation and replace it with a fresh re-installation (which must be expected to be functionally equivalent). is everyone okay with that approach, in principle, infrastructure task force? cheers, oe </div> </body> </html>