[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Mon Dec 17 11:36:32 UTC 2018

Hi Yves, 

I am sorry for the long silence, I wanted to ask you: are your problems still current? If so, could you send me a hint how to reproduce? 

Regards, 
Martin 

-- 
Martin Matthiesen 
CSC - Tieteen tietotekniikan keskus 
CSC - IT Center for Science 
PL 405, 02101 Espoo, Finland 
+358 9 457 2376, martin.matthiesen at csc.fi 
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 
Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704 

> From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
> To: "Stephan Oepen" <oe at ifi.uio.no>
> Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi>, "infrastructure"
> <infrastructure at nlpl.eu>
> Sent: Wednesday, 28 November, 2018 16:55:27
> Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

> Curiously enough, when I run the OpenNMT training script that worked fine a
> month ago, I get this error now:

> THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=35 : CUDA driver
> version is insufficient for CUDA runtime version

> Traceback (most recent call last):

> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in
> <module>

> main(opt)

> File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main

> single_main(opt)

> File
> "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py",
> line 73, in main

> opt = training_opt_postprocessing(opt)

> File
> "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py",
> line 60, in training_opt_postprocessing

> torch.cuda.set_device(opt.device_id)

> File
> "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py",
> line 262, in set_device

> torch._C._cuda_setDevice(device)

> RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for
> CUDA runtime version at torch/csrc/cuda/Module.cpp:32

> I have no idea if this is related to the PyTorch issue, but could it be that
> some CUDA code got updated on Taito in the meantime?

> Yves

> From: Stephan Oepen <oe at ifi.uio.no>
> Sent: Wednesday, November 28, 2018 4:43:47 PM
> To: Scherrer, Yves
> Cc: Martin Matthiesen; infrastructure
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> well, then it should not be too hard to get the PyTorch installation
> on Taito to work on the gpu nodes :-).

> i will have a look now ...

> oe

> On Wed, Nov 28, 2018 at 3:38 PM Scherrer, Yves
> <yves.scherrer at helsinki.fi> wrote:

> > Hi,

> > I did my OpenNMT-py experiments on both Abel and Taito.

>> On Taito, I got training speeds of about 13000 tokens/s, on Abel it was about
> > 4000 tokens/s.

>> A colleague who used an independent OpenNMT-py module on Taito-GPU during the
> > summer obtained about 9000 tokens/s with a different dataset.

>> I also just started a CPU-only training run on Taito, which got around 1000
> > tokens/s.

>> This leads me to believe that my experiments – at least those on Taito – did use
> > the GPU…

> > Best,

> > Yves

> > ________________________________
> > From: Stephan Oepen <oe at ifi.uio.no>
> > Sent: Wednesday, November 28, 2018 4:08:46 PM
> > To: Scherrer, Yves
> > Cc: Martin Matthiesen; infrastructure
> > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

> > as for the OpenNMT-py experiments, did you do those on Abel or Taito,
> > or both? using gpus on Taito? in other words, do you believe that
> > OpenNMT-py (in contrast to PyTorch) works on Taito gpu nodes?

> > oe

> > On Wed, Nov 28, 2018 at 2:47 PM Scherrer, Yves
> > <yves.scherrer at helsinki.fi> wrote:

> > > Hi,

>> > I’m following up on this one with a related issue. I am testing PyTorch
> > > independently of OpenNMT-py, but cannot get it to run on (Taito-)GPU.

>> > Specifically, although I was logged in to Taito-GPU, I cannot get the test
> > > script described on the Wiki page to return True:

>> > [GPU-Env lstmtagger]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 --pty
> > > python3 /proj/nlpl/software/pytorch/0.4.1/test.py

> > > srun: job 32089470 queued and waiting for resources

> > > srun: job 32089470 has been allocated resources

> > > False

> > > I also get ‘False’ when running the following script through sbatch:

> > > #SBATCH -J cudatest

> > > #SBATCH -o cudatest.%j.out

> > > #SBATCH -e cudatest.%j.err

> > > #SBATCH -t 0:05:00

> > > #SBATCH -p gputest

> > > #SBATCH -N 1

> > > #SBATCH --gres=gpu:k80:1

> > > #SBATCH --mem=1g

> > > module use -a /proj/nlpl/software/modulefiles/

> > > module load nlpl-pytorch

> > > srun python3 /proj/nlpl/software/pytorch/0.4.1/test.py

> > > Has there been any change lately? Or am I missing something obvious?

> > > Best,

> > > Yves

> > > ________________________________
> > > From: Stephan Oepen <oe at ifi.uio.no>
> > > Sent: Wednesday, September 26, 2018 11:10:12 PM
> > > To: Scherrer, Yves
> > > Cc: Martin Matthiesen; infrastructure
> > > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

> > > hi again,

> > > > i actually had a go at my own glibc and PyTorch installations on Taito, but
> > > > so far gpu support is evasive.

> > > actually, with a little more tinkering, i now believe i might have a
> > > working installation of PyTorch 0.4.1 and OpenNMT-py 0.2.1 on Taito
> > > too, seemingly functional on both cpu and gpu nodes:

> > > [oe at taito-login4 ~]$ module purge
> > > [oe at taito-login4 ~]$ module load nlpl-opennmt-py
> > > Loading application python-3.5.3 environment with needed modules
> > > [oe at taito-login4 ~]$ module list

> > > Currently Loaded Modules:
> > > 1) gcc/5.4.0 2) intelmpi/5.1.3 3) mkl/11.3.2 4) python/3.5.3
> > > 5) python-env/3.5.3 6) nlpl-pytorch/0.4.1 7) nlpl-opennmt-py/0.2.1

> > > [oe at taito-login4 ~]$ type -all python
> > > python is /proj/nlpl/software/opennmt-py/0.2.1/bin/python
> > > python is /proj/nlpl/software/pytorch/0.4.1/bin/python
> > > python is /appl/opt/python/3.5.3-gnu540/bin/python
> > > python is /usr/bin/python
> > > [oe at taito-login4 ~]$ python -c "import torch; import onmt;
> > > print(torch.cuda.is_available());"
> > > False

> > > [oe at taito-login4 ~]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t
> > > 15 --pty \
> > > python -c "import torch; import onmt; print(torch.cuda.is_available());"
> > > True

> > > —yves (or joerg), i would have a hard time testing things in much more
> > > depth. any chance you would have some time to try and replicate the
> > > validation steps your are currently running on Abel on Taito too?

> > > with a sense of accomplishment :-), oe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20181217/5e9d2813/attachment.htm>