[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Mon Dec 17 13:35:19 UTC 2018

Hi Martin,

Yes, the problem still occurs. Please have a look at the train.sh SLURM script in /wrk/yvessche/onmt_test3 – this script worked fine when Stephan first installed OpenNMT, but has been failing in the last couple of weeks.

Best,
Yves

________________________________
From: Martin Matthiesen <martin.matthiesen at csc.fi>
Sent: Monday, December 17, 2018 1:36:32 PM
To: Scherrer, Yves
Cc: Stephan Oepen; infrastructure
Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Hi Yves,

I am sorry for the long silence, I wanted to ask you: are your problems still current? If so, could you send me a hint how to reproduce?

Regards,
Martin

--
Martin Matthiesen
CSC - Tieteen tietotekniikan keskus
CSC - IT Center for Science
PL 405, 02101 Espoo, Finland
+358 9 457 2376, martin.matthiesen at csc.fi
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
Fingerprint: AA25 6F56 5C9A 8B42 009F  BA70 74B1 2876 FD89 0704

________________________________
From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
To: "Stephan Oepen" <oe at ifi.uio.no>
Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi>, "infrastructure" <infrastructure at nlpl.eu>
Sent: Wednesday, 28 November, 2018 16:55:27
Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Curiously enough, when I run the OpenNMT training script that worked fine a month ago, I get this error now:

THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=35 : CUDA driver version is insufficient for CUDA runtime version

Traceback (most recent call last):

  File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module>

    main(opt)

  File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main

    single_main(opt)

  File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main

    opt = training_opt_postprocessing(opt)

  File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing

    torch.cuda.set_device(opt.device_id)

  File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 262, in set_device

    torch._C._cuda_setDevice(device)

RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32

I have no idea if this is related to the PyTorch issue, but could it be that some CUDA code got updated on Taito in the meantime?

Yves

________________________________
From: Stephan Oepen <oe at ifi.uio.no>
Sent: Wednesday, November 28, 2018 4:43:47 PM
To: Scherrer, Yves
Cc: Martin Matthiesen; infrastructure
Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

well, then it should not be too hard to get the PyTorch installation
on Taito to work on the gpu nodes :-).

i will have a look now ...

oe

On Wed, Nov 28, 2018 at 3:38 PM Scherrer, Yves
<yves.scherrer at helsinki.fi> wrote:
>
> Hi,
>
>
>
> I did my OpenNMT-py experiments on both Abel and Taito.
>
> On Taito, I got training speeds of about 13000 tokens/s, on Abel it was about 4000 tokens/s.
>
> A colleague who used an independent OpenNMT-py module on Taito-GPU during the summer obtained about 9000 tokens/s with a different dataset.
>
> I also just started a CPU-only training run on Taito, which got around 1000 tokens/s.
>
> This leads me to believe that my experiments – at least those on Taito – did use the GPU…
>
>
>
> Best,
>
> Yves
>
>
>
> ________________________________
> From: Stephan Oepen <oe at ifi.uio.no>
> Sent: Wednesday, November 28, 2018 4:08:46 PM
> To: Scherrer, Yves
> Cc: Martin Matthiesen; infrastructure
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>
> as for the OpenNMT-py experiments, did you do those on Abel or Taito,
> or both?  using gpus on Taito?  in other words, do you believe that
> OpenNMT-py (in contrast to PyTorch) works on Taito gpu nodes?
>
> oe
>
> On Wed, Nov 28, 2018 at 2:47 PM Scherrer, Yves
> <yves.scherrer at helsinki.fi> wrote:
> >
> > Hi,
> >
> >
> >
> > I’m following up on this one with a related issue. I am testing PyTorch independently of OpenNMT-py, but cannot get it to run on (Taito-)GPU.
> >
> >
> >
> > Specifically, although I was logged in to Taito-GPU, I cannot get the test script described on the Wiki page to return True:
> >
> >
> >
> > [GPU-Env lstmtagger]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 --pty python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> >
> > srun: job 32089470 queued and waiting for resources
> >
> > srun: job 32089470 has been allocated resources
> >
> > False
> >
> >
> >
> > I also get ‘False’ when running the following script through sbatch:
> >
> >
> >
> > #SBATCH -J cudatest
> >
> > #SBATCH -o cudatest.%j.out
> >
> > #SBATCH -e cudatest.%j.err
> >
> > #SBATCH -t 0:05:00
> >
> > #SBATCH -p gputest
> >
> > #SBATCH -N 1
> >
> > #SBATCH --gres=gpu:k80:1
> >
> > #SBATCH --mem=1g
> >
> > module use -a /proj/nlpl/software/modulefiles/
> >
> > module load nlpl-pytorch
> >
> > srun python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> >
> >
> >
> > Has there been any change lately? Or am I missing something obvious?
> >
> >
> >
> > Best,
> >
> > Yves
> >
> >
> >
> >
> >
> > ________________________________
> > From: Stephan Oepen <oe at ifi.uio.no>
> > Sent: Wednesday, September 26, 2018 11:10:12 PM
> > To: Scherrer, Yves
> > Cc: Martin Matthiesen; infrastructure
> > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> >
> > hi again,
> >
> > > i actually had a go at my own glibc and PyTorch installations on Taito, but
> > > so far gpu support is evasive.
> >
> > actually, with a little more tinkering, i now believe i might have a
> > working installation of PyTorch 0.4.1 and OpenNMT-py 0.2.1 on Taito
> > too, seemingly functional on both cpu and gpu nodes:
> >
> > [oe at taito-login4 ~]$ module purge
> > [oe at taito-login4 ~]$ module load nlpl-opennmt-py
> > Loading application python-3.5.3 environment with needed modules
> > [oe at taito-login4 ~]$ module list
> >
> > Currently Loaded Modules:
> >   1) gcc/5.4.0   2) intelmpi/5.1.3   3) mkl/11.3.2   4) python/3.5.3
> > 5) python-env/3.5.3   6) nlpl-pytorch/0.4.1   7) nlpl-opennmt-py/0.2.1
> >
> > [oe at taito-login4 ~]$ type -all python
> > python is /proj/nlpl/software/opennmt-py/0.2.1/bin/python
> > python is /proj/nlpl/software/pytorch/0.4.1/bin/python
> > python is /appl/opt/python/3.5.3-gnu540/bin/python
> > python is /usr/bin/python
> > [oe at taito-login4 ~]$ python -c "import torch; import onmt;
> > print(torch.cuda.is_available());"
> > False
> >
> > [oe at taito-login4 ~]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t
> > 15 --pty \
> >   python -c "import torch; import onmt; print(torch.cuda.is_available());"
> > True
> >
> > —yves (or joerg), i would have a hard time testing things in much more
> > depth.  any chance you would have some time to try and replicate the
> > validation steps your are currently running on Abel on Taito too?
> >
> > with a sense of accomplishment :-), oe

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20181217/615420e7/attachment.htm>