[NLPL Task Force (A)] PyTorch on Taito gpu nodes

Mon Dec 10 21:26:03 UTC 2018

Hi Stephan,

It almost works… I’m getting the following error now:

  File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torchtext/utils.py", line 2, in <module>
    import requests
ImportError: No module named ‘requests'

The situation is the following:
- The opennmt-py/0.2.1 environment contains ‘torchtext', but not its dependency ‘requests'.
- The (current) pytorch/201811 environment contains neither ‘torchtext' nor its dependency ‘requests’.
- The (old) pytorch/0.4.1 environment does not contain ‘torchtext', but contains ‘requests’, which explains why it worked before.

I don’t know if it makes more sense to add the requests module to the pytorch environment or to the opennmt-py one...

Best,
Yves

> On 10 Dec 2018, at 22:45, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
> hi yves,
> 
> no, i have not heard more on this issue, but i would expect
> nlpl-opennmt-py to work already now: it loads the default version of
> nlpl-pytorch, which currently (still) is the one i installed recently,
> i.e. the functional one.  could you give that a try please?
> 
> best, oe
> 
> On Mon, Dec 10, 2018 at 9:22 PM Scherrer, Yves
> <yves.scherrer at helsinki.fi> wrote:
>> 
>> Hi Stephan, all,
>> 
>> Has there been any recent activity on this matter? Did you get any news from CSC about some internal changes they made? I am asking because I might need a working OpenNMT-py install soon for some of the December NLPL milestones…
>> 
>> Best,
>> Yves
>> 
>>> On 28 Nov 2018, at 19:45, Stephan Oepen <oe at ifi.uio.no> wrote:
>>> 
>>> hi yves and eivind,
>>> 
>>> you both discovered independently today that the default NLPL
>>> installation of PyTorch on Taito appears to have lost  its support for
>>> gpu nodes.  the software has not changed since october, so i suspect
>>> that some system-wide upgrade of the nvidia drivers or cuda libraries
>>> may be the cause of these problems.  i have been unable to fully track
>>> down what happened, but ...
>>> 
>>> it (somewhat surprisingly) appears to be the case that re-doing my
>>> original PyTorch installation recipe (using the exact same explicit
>>> dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in
>>> a functional PyTorch installation again.  for tonight, i am keeping
>>> the original (gpu-dysfunctional) version for further debugging.  but
>>> please try the following:
>>> 
>>> $ module purge; module load nlpl-pytorch/201811
>>> $ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \
>>> python3 /proj/nlpl/software/pytorch/0.4.1/test.py
>>> True
>>> 
>>> martin, do you think it is worth checking with the CSC cuda
>>> wizards—they might be in a much better position to guess which
>>> system-wide external parameters have changed recently?  to experience
>>> the problem, replace the module version ‘201811’ with ‘0.4.1’; the
>>> above test script should output False, i.e. cuda is not available.
>>> 
>>> when debugging earlier today, yves observed that our installation of
>>> OpenNMT-py (which is built on top of PyTorch) complains:
>>> 
>>> File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
>>> line 262, in set_device
>>> 
>>>   torch._C._cuda_setDevice(device)
>>> 
>>> RuntimeError: cuda runtime error (35) : CUDA driver version is
>>> insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32
>>> 
>>> i am all but certain that this is the same root problem, only when
>>> trying to initialize from OpenNMT-py we somehow run into a full-blown
>>> exception, whereas my PyTorch test script merely reports cuda as not
>>> being available.
>>> 
>>> i had originally promised myself and everyone who would listen that
>>> module installations must remain unchanged, once announced publicly.
>>> fixing what appears to be a fatal (if mysterious) flaw in our original
>>> PyTorch 0.4.1 module, however, may present an exception to that
>>> policy.  if we fail to work out what exactly caused the recent loss of
>>> gpu support in that module in the next few days, i think i will just
>>> wipe out the 0.4.1 installation and replace it with a fresh
>>> re-installation (which must be expected to be functionally
>>> equivalent).
>>> 
>>> is everyone okay with that approach, in principle, infrastructure task force?
>>> 
>>> cheers, oe
>>