[NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)

Mon Dec 17 22:31:43 UTC 2018

just fyi: because just re-installing the same version appeared to
resolve the PyToch problem a few weeks ago, i just did (in my
OpenNMT-py source directory):

python3 -m pip install -U $(python3 -m pip list | tail -n +3 | awk '{print $1}')
python setup.py install

thinking (possbily over-)optimistically, maybe the problem has
magically disappeared already?

cheers, oe

On Mon, Dec 17, 2018 at 11:12 PM Stephan Oepen <oe at ifi.uio.no> wrote:
>
> hi yves,
>
> i am sorry i had lost track of this pending problem for the past week
> or two!  we were hosting TLT here in oslo this past week (and both my
> kids have birthdays in december :-), so NLPL just had to sit on the
> back-burner for a little while ...
>
> i am currently trying to reproduce the problem, and it appears that
> just ‘import onmt’ is not enough to get to the point of failure,
> right?
>
> could you make the complete data directory group- or world-readable,
> so i can try running the ‘train.py’ script without creating my own
> copy of the data?
>
> best, oe
>
> On Mon, Dec 17, 2018 at 2:36 PM Scherrer, Yves
> <yves.scherrer at helsinki.fi> wrote:
> >
> > Hi Martin,
> >
> >
> >
> > Yes, the problem still occurs. Please have a look at the train.sh SLURM script in /wrk/yvessche/onmt_test3 – this script worked fine when Stephan first installed OpenNMT, but has been failing in the last couple of weeks.
> >
> >
> >
> > Best,
> >
> > Yves
> >
> >
> >
> > ________________________________
> > From: Martin Matthiesen <martin.matthiesen at csc.fi>
> > Sent: Monday, December 17, 2018 1:36:32 PM
> > To: Scherrer, Yves
> > Cc: Stephan Oepen; infrastructure
> > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> >
> > Hi Yves,
> >
> > I am sorry for the long silence, I wanted to ask you: are your problems still current? If so, could you send me a hint how to reproduce?
> >
> > Regards,
> > Martin
> >
> > --
> > Martin Matthiesen
> > CSC - Tieteen tietotekniikan keskus
> > CSC - IT Center for Science
> > PL 405, 02101 Espoo, Finland
> > +358 9 457 2376, martin.matthiesen at csc.fi
> > Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
> > Fingerprint: AA25 6F56 5C9A 8B42 009F  BA70 74B1 2876 FD89 0704
> >
> > ________________________________
> >
> > From: "Yves Scherrer" <yves.scherrer at helsinki.fi>
> > To: "Stephan Oepen" <oe at ifi.uio.no>
> > Cc: "Martin Matthiesen" <martin.matthiesen at csc.fi>, "infrastructure" <infrastructure at nlpl.eu>
> > Sent: Wednesday, 28 November, 2018 16:55:27
> > Subject: RE: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> >
> > Curiously enough, when I run the OpenNMT training script that worked fine a month ago, I get this error now:
> >
> >
> >
> > THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=35 : CUDA driver version is insufficient for CUDA runtime version
> >
> > Traceback (most recent call last):
> >
> >   File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 40, in <module>
> >
> >     main(opt)
> >
> >   File "/proj/nlpl/software/opennmt-py/0.2.1/scripts/train.py", line 27, in main
> >
> >     single_main(opt)
> >
> >   File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 73, in main
> >
> >     opt = training_opt_postprocessing(opt)
> >
> >   File "/wrk/project_nlpl/software/opennmt-py/0.2.1/lib/python3.5/site-packages/OpenNMT_py-0.2.1-py3.5.egg/onmt/train_single.py", line 60, in training_opt_postprocessing
> >
> >     torch.cuda.set_device(opt.device_id)
> >
> >   File "/proj/nlpl/software/pytorch/0.4.1/lib/python3.5/site-packages/torch/cuda/__init__.py", line 262, in set_device
> >
> >     torch._C._cuda_setDevice(device)
> >
> > RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:32
> >
> >
> >
> > I have no idea if this is related to the PyTorch issue, but could it be that some CUDA code got updated on Taito in the meantime?
> >
> >
> >
> > Yves
> >
> >
> >
> > ________________________________
> > From: Stephan Oepen <oe at ifi.uio.no>
> > Sent: Wednesday, November 28, 2018 4:43:47 PM
> > To: Scherrer, Yves
> > Cc: Martin Matthiesen; infrastructure
> > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> >
> > well, then it should not be too hard to get the PyTorch installation
> > on Taito to work on the gpu nodes :-).
> >
> > i will have a look now ...
> >
> > oe
> >
> > On Wed, Nov 28, 2018 at 3:38 PM Scherrer, Yves
> > <yves.scherrer at helsinki.fi> wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I did my OpenNMT-py experiments on both Abel and Taito.
> > >
> > > On Taito, I got training speeds of about 13000 tokens/s, on Abel it was about 4000 tokens/s.
> > >
> > > A colleague who used an independent OpenNMT-py module on Taito-GPU during the summer obtained about 9000 tokens/s with a different dataset.
> > >
> > > I also just started a CPU-only training run on Taito, which got around 1000 tokens/s.
> > >
> > > This leads me to believe that my experiments – at least those on Taito – did use the GPU…
> > >
> > >
> > >
> > > Best,
> > >
> > > Yves
> > >
> > >
> > >
> > > ________________________________
> > > From: Stephan Oepen <oe at ifi.uio.no>
> > > Sent: Wednesday, November 28, 2018 4:08:46 PM
> > > To: Scherrer, Yves
> > > Cc: Martin Matthiesen; infrastructure
> > > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> > >
> > > as for the OpenNMT-py experiments, did you do those on Abel or Taito,
> > > or both?  using gpus on Taito?  in other words, do you believe that
> > > OpenNMT-py (in contrast to PyTorch) works on Taito gpu nodes?
> > >
> > > oe
> > >
> > > On Wed, Nov 28, 2018 at 2:47 PM Scherrer, Yves
> > > <yves.scherrer at helsinki.fi> wrote:
> > > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I’m following up on this one with a related issue. I am testing PyTorch independently of OpenNMT-py, but cannot get it to run on (Taito-)GPU.
> > > >
> > > >
> > > >
> > > > Specifically, although I was logged in to Taito-GPU, I cannot get the test script described on the Wiki page to return True:
> > > >
> > > >
> > > >
> > > > [GPU-Env lstmtagger]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t 15 --pty python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> > > >
> > > > srun: job 32089470 queued and waiting for resources
> > > >
> > > > srun: job 32089470 has been allocated resources
> > > >
> > > > False
> > > >
> > > >
> > > >
> > > > I also get ‘False’ when running the following script through sbatch:
> > > >
> > > >
> > > >
> > > > #SBATCH -J cudatest
> > > >
> > > > #SBATCH -o cudatest.%j.out
> > > >
> > > > #SBATCH -e cudatest.%j.err
> > > >
> > > > #SBATCH -t 0:05:00
> > > >
> > > > #SBATCH -p gputest
> > > >
> > > > #SBATCH -N 1
> > > >
> > > > #SBATCH --gres=gpu:k80:1
> > > >
> > > > #SBATCH --mem=1g
> > > >
> > > > module use -a /proj/nlpl/software/modulefiles/
> > > >
> > > > module load nlpl-pytorch
> > > >
> > > > srun python3 /proj/nlpl/software/pytorch/0.4.1/test.py
> > > >
> > > >
> > > >
> > > > Has there been any change lately? Or am I missing something obvious?
> > > >
> > > >
> > > >
> > > > Best,
> > > >
> > > > Yves
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: Stephan Oepen <oe at ifi.uio.no>
> > > > Sent: Wednesday, September 26, 2018 11:10:12 PM
> > > > To: Scherrer, Yves
> > > > Cc: Martin Matthiesen; infrastructure
> > > > Subject: Re: [NLPL Task Force (A)] OpenNMT installation for NLPL (on Abel)
> > > >
> > > > hi again,
> > > >
> > > > > i actually had a go at my own glibc and PyTorch installations on Taito, but
> > > > > so far gpu support is evasive.
> > > >
> > > > actually, with a little more tinkering, i now believe i might have a
> > > > working installation of PyTorch 0.4.1 and OpenNMT-py 0.2.1 on Taito
> > > > too, seemingly functional on both cpu and gpu nodes:
> > > >
> > > > [oe at taito-login4 ~]$ module purge
> > > > [oe at taito-login4 ~]$ module load nlpl-opennmt-py
> > > > Loading application python-3.5.3 environment with needed modules
> > > > [oe at taito-login4 ~]$ module list
> > > >
> > > > Currently Loaded Modules:
> > > >   1) gcc/5.4.0   2) intelmpi/5.1.3   3) mkl/11.3.2   4) python/3.5.3
> > > > 5) python-env/3.5.3   6) nlpl-pytorch/0.4.1   7) nlpl-opennmt-py/0.2.1
> > > >
> > > > [oe at taito-login4 ~]$ type -all python
> > > > python is /proj/nlpl/software/opennmt-py/0.2.1/bin/python
> > > > python is /proj/nlpl/software/pytorch/0.4.1/bin/python
> > > > python is /appl/opt/python/3.5.3-gnu540/bin/python
> > > > python is /usr/bin/python
> > > > [oe at taito-login4 ~]$ python -c "import torch; import onmt;
> > > > print(torch.cuda.is_available());"
> > > > False
> > > >
> > > > [oe at taito-login4 ~]$ srun -n 1 -p gputest --gres=gpu:k80:1 --mem 1G -t
> > > > 15 --pty \
> > > >   python -c "import torch; import onmt; print(torch.cuda.is_available());"
> > > > True
> > > >
> > > > —yves (or joerg), i would have a hard time testing things in much more
> > > > depth.  any chance you would have some time to try and replicate the
> > > > validation steps your are currently running on Abel on Taito too?
> > > >
> > > > with a sense of accomplishment :-), oe
> >