[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Mon Dec 17 22:12:14 UTC 2018

hi yves,

i am sorry i had lost track of this pending problem for the past week
or two!  we were hosting TLT here in oslo this past week (and both my
kids have birthdays in december :-), so NLPL just had to sit on the
back-burner for a little while ...

i am currently trying to reproduce the problem, and it appears that
just ‘import onmt’ is not enough to get to the point of failure,
right?

could you make the complete data directory group- or world-readable,
so i can try running the ‘train.py’ script without creating my own
copy of the data?

best, oe

On Mon, Dec 17, 2018 at 2:36 PM Scherrer, Yves
<yves.scherrer at helsinki.fi> wrote:
>
> Hi Martin,
>
>
>
> Yes, the problem still occurs. Please have a look at the train.sh SLURM script in /wrk/yvessche/onmt_test3 – this script worked fine when Stephan first installed OpenNMT, but has been failing in the last couple of weeks.
>
>
>
> Best,
>
> Yves
>
>
>
> ________________________________
> From: Martin Matthiesen <martin.matthiesen at csc.fi>
> Sent: Monday, December 17, 2018 1:36:32 PM
> To: Scherrer, Yves
> Cc: Stephan Oepen; infrastructure
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>
> Hi Yves,
>
> I am sorry for the long silence, I wanted to ask you: are your problems still current? If so, could you send me a hint how to reproduce?
>
> Regards,
> Martin
>
> --
> Martin Matthiesen
> CSC - Tieteen tietotekniikan keskus
> CSC - IT Center for Science
> PL 405, 02101 Espoo, Finland
> +358 9 457 2376, martin.matthiesen at csc.fi
> Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
> Fingerprint: AA25 6F56 5C9A 8B42 009F  BA70 74B1 2876 FD89 0704
>
> ________________________________
>
> From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
> To: "Stephan Oepen" <oe at ifi.uio.no>
> Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi>, "infrastructure" <infrastructure at nlpl.eu>
> Sent: Wednesday, 28 November, 2018 16:55:27
> Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>
> Curiously enough, when I run the OpenNMT training script that worked fine a month ago, I get this error now:
>
>
>
> THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=35 : CUDA driver version is insufficient for CUDA runtime version
>
> Traceback (most recent call last):
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module>
>
>     main(opt)
>
>   File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main
>
>     single_main(opt)
>
>   File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main
>
>     opt = training_opt_postprocessing(opt)
>
>   File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing
>
>     torch.cuda.set_device(opt.device_id)
>
>   File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 262, in set_device
>
>     torch._C._cuda_setDevice(device)
>
> RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32
>
>
>
> I have no idea if this is related to the PyTorch issue, but could it be that some CUDA code got updated on Taito in the meantime?
>
>
>
> Yves
>
>
>
> ________________________________
> From: Stephan Oepen <oe at ifi.uio.no>
> Sent: Wednesday, November 28, 2018 4:43:47 PM
> To: Scherrer, Yves
> Cc: Martin Matthiesen; infrastructure
> Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
>
> well, then it should not be too hard to get the PyTorch installation
> on Taito to work on the gpu nodes :-).
>
> i will have a look now ...
>
> oe
>
> On Wed, Nov 28, 2018 at 3:38 PM Scherrer, Yves
> <yves.scherrer at helsinki.fi> wrote:
> >
> > Hi,
> >
> >
> >
> > I did my OpenNMT-py experiments on both Abel and Taito.
> >
> > On Taito, I got training speeds of about 13000 tokens/s, on Abel it was about 4000 tokens/s.
> >
> > A colleague who used an independent OpenNMT-py module on Taito-GPU during the summer obtained about 9000 tokens/s with a different dataset.
> >
> > I also just started a CPU-only training run on Taito, which got around 1000 tokens/s.
> >
> > This leads me to believe that my experiments – at least those on Taito – did use the GPU…
> >
> >
> >
> > Best,
> >
> > Yves
> >
> >
> >
> > ________________________________
> > From: Stephan Oepen <oe at ifi.uio.no>
> > Sent: Wednesday, November 28, 2018 4:08:46 PM
> > To: Scherrer, Yves
> > Cc: Martin Matthiesen; infrastructure
> > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> >
> > as for the OpenNMT-py experiments, did you do those on Abel or Taito,
> > or both?  using gpus on Taito?  in other words, do you believe that
> > OpenNMT-py (in contrast to PyTorch) works on Taito gpu nodes?
> >
> > oe
> >
> > On Wed, Nov 28, 2018 at 2:47 PM Scherrer, Yves
> > <yves.scherrer at helsinki.fi> wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I’m following up on this one with a related issue. I am testing PyTorch independently of OpenNMT-py, but cannot get it to run on (Taito-)GPU.
> > >
> > >
> > >
> > > Specifically, although I was logged in to Taito-GPU, I cannot get the test script described on the Wiki page to return True:
> > >
> > >
> > >
> > > [GPU-Env lstmtagger]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 --pty python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> > >
> > > srun: job 32089470 queued and waiting for resources
> > >
> > > srun: job 32089470 has been allocated resources
> > >
> > > False
> > >
> > >
> > >
> > > I also get ‘False’ when running the following script through sbatch:
> > >
> > >
> > >
> > > #SBATCH -J cudatest
> > >
> > > #SBATCH -o cudatest.%j.out
> > >
> > > #SBATCH -e cudatest.%j.err
> > >
> > > #SBATCH -t 0:05:00
> > >
> > > #SBATCH -p gputest
> > >
> > > #SBATCH -N 1
> > >
> > > #SBATCH --gres=gpu:k80:1
> > >
> > > #SBATCH --mem=1g
> > >
> > > module use -a /proj/nlpl/software/modulefiles/
> > >
> > > module load nlpl-pytorch
> > >
> > > srun python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> > >
> > >
> > >
> > > Has there been any change lately? Or am I missing something obvious?
> > >
> > >
> > >
> > > Best,
> > >
> > > Yves
> > >
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: Stephan Oepen <oe at ifi.uio.no>
> > > Sent: Wednesday, September 26, 2018 11:10:12 PM
> > > To: Scherrer, Yves
> > > Cc: Martin Matthiesen; infrastructure
> > > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> > >
> > > hi again,
> > >
> > > > i actually had a go at my own glibc and PyTorch installations on Taito, but
> > > > so far gpu support is evasive.
> > >
> > > actually, with a little more tinkering, i now believe i might have a
> > > working installation of PyTorch 0.4.1 and OpenNMT-py 0.2.1 on Taito
> > > too, seemingly functional on both cpu and gpu nodes:
> > >
> > > [oe at taito-login4 ~]$ module purge
> > > [oe at taito-login4 ~]$ module load nlpl-opennmt-py
> > > Loading application python-3.5.3 environment with needed modules
> > > [oe at taito-login4 ~]$ module list
> > >
> > > Currently Loaded Modules:
> > >   1) gcc/5.4.0   2) intelmpi/5.1.3   3) mkl/11.3.2   4) python/3.5.3
> > > 5) python-env/3.5.3   6) nlpl-pytorch/0.4.1   7) nlpl-opennmt-py/0.2.1
> > >
> > > [oe at taito-login4 ~]$ type -all python
> > > python is /proj/nlpl/software/opennmt-py/0.2.1/bin/python
> > > python is /proj/nlpl/software/pytorch/0.4.1/bin/python
> > > python is /appl/opt/python/3.5.3-gnu540/bin/python
> > > python is /usr/bin/python
> > > [oe at taito-login4 ~]$ python -c "import torch; import onmt;
> > > print(torch.cuda.is_available());"
> > > False
> > >
> > > [oe at taito-login4 ~]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t
> > > 15 --pty \
> > >   python -c "import torch; import onmt; print(torch.cuda.is_available());"
> > > True
> > >
> > > —yves (or joerg), i would have a hard time testing things in much more
> > > depth.  any chance you would have some time to try and replicate the
> > > validation steps your are currently running on Abel on Taito too?
> > >
> > > with a sense of accomplishment :-), oe
>