[NLPL Task Force (A)] PyTorch on Taito gpu nodes
Scherrer, Yves
yves.scherrer at helsinki.fi
Mon Dec 10 21:26:03 UTC 2018
Hi Stephan,
It almost works… I’m getting the following error now:
File "/proj/nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/torchtext/utils.py", line 2, in <module>
import requests
ImportError: No module named ‘requests'
The situation is the following:
- The opennmt-py/0.2.1 environment contains ‘torchtext', but not its dependency ‘requests'.
- The (current) pytorch/201811 environment contains neither ‘torchtext' nor its dependency ‘requests’.
- The (old) pytorch/0.4.1 environment does not contain ‘torchtext', but contains ‘requests’, which explains why it worked before.
I don’t know if it makes more sense to add the requests module to the pytorch environment or to the opennmt-py one...
Best,
Yves
> On 10 Dec 2018, at 22:45, Stephan Oepen <oe at ifi.uio.no> wrote:
>
> hi yves,
>
> no, i have not heard more on this issue, but i would expect
> nlpl-opennmt-py to work already now: it loads the default version of
> nlpl-pytorch, which currently (still) is the one i installed recently,
> i.e. the functional one. could you give that a try please?
>
> best, oe
>
> On Mon, Dec 10, 2018 at 9:22 PM Scherrer, Yves
> <yves.scherrer at helsinki.fi> wrote:
>>
>> Hi Stephan, all,
>>
>> Has there been any recent activity on this matter? Did you get any news from CSC about some internal changes they made? I am asking because I might need a working OpenNMT-py install soon for some of the December NLPL milestones…
>>
>> Best,
>> Yves
>>
>>> On 28 Nov 2018, at 19:45, Stephan Oepen <oe at ifi.uio.no> wrote:
>>>
>>> hi yves and eivind,
>>>
>>> you both discovered independently today that the default NLPL
>>> installation of PyTorch on Taito appears to have lost its support for
>>> gpu nodes. the software has not changed since october, so i suspect
>>> that some system-wide upgrade of the nvidia drivers or cuda libraries
>>> may be the cause of these problems. i have been unable to fully track
>>> down what happened, but ...
>>>
>>> it (somewhat surprisingly) appears to be the case that re-doing my
>>> original PyTorch installation recipe (using the exact same explicit
>>> dependencies as before, i.e. python-env/3.5.3 and cuda/9.0) results in
>>> a functional PyTorch installation again. for tonight, i am keeping
>>> the original (gpu-dysfunctional) version for further debugging. but
>>> please try the following:
>>>
>>> $ module purge; module load nlpl-pytorch/201811
>>> $ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 \
>>> python3 /proj/nlpl/software/pytorch/0.4.1/test.py
>>> True
>>>
>>> martin, do you think it is worth checking with the CSC cuda
>>> wizards—they might be in a much better position to guess which
>>> system-wide external parameters have changed recently? to experience
>>> the problem, replace the module version ‘201811’ with ‘0.4.1’; the
>>> above test script should output False, i.e. cuda is not available.
>>>
>>> when debugging earlier today, yves observed that our installation of
>>> OpenNMT-py (which is built on top of PyTorch) complains:
>>>
>>> File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
>>> line 262, in set_device
>>>
>>> torch._C._cuda_setDevice(device)
>>>
>>> RuntimeError: cuda runtime error (35) : CUDA driver version is
>>> insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32
>>>
>>> i am all but certain that this is the same root problem, only when
>>> trying to initialize from OpenNMT-py we somehow run into a full-blown
>>> exception, whereas my PyTorch test script merely reports cuda as not
>>> being available.
>>>
>>> i had originally promised myself and everyone who would listen that
>>> module installations must remain unchanged, once announced publicly.
>>> fixing what appears to be a fatal (if mysterious) flaw in our original
>>> PyTorch 0.4.1 module, however, may present an exception to that
>>> policy. if we fail to work out what exactly caused the recent loss of
>>> gpu support in that module in the next few days, i think i will just
>>> wipe out the 0.4.1 installation and replace it with a fresh
>>> re-installation (which must be expected to be functionally
>>> equivalent).
>>>
>>> is everyone okay with that approach, in principle, infrastructure task force?
>>>
>>> cheers, oe
>>
More information about the infrastructure
mailing list