[NLPL Task Force (A)] Trouble with using GPU in Cupy package

Sat May 11 20:59:01 UTC 2019

yes, if you are running some software that requires CUDA, typically
you need to 'module load' the right version of CUDA first.  but some
modules autoload their dependencies, including the CuPy installation i
made.  i appears you are missing the initial 'module use' command, to
actually activate the NLPL collection of modules.  please see:

http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue

so, to try out 'my' CuPy installation, you would have to

$ module use -a /proj*/nlpl/software/modulefiles/
$ module purge; module load nlpl-cupy
$ type python3
python3 is /projects/nlpl/software/cupy/5.4.0/bin/3.7/python3

but maybe your private installation actually works fine now, if you
pre-load the right CUDA version?  even if so, i would of course also
be curious to know whether the NLPL installation is functional, in
case others might want to use it ...

cheers, oe

On Sat, May 11, 2019 at 10:26 PM Andrew Dyer
<Andrew.Dyer.6854 at student.uu.se> wrote:
>
> Hi Stephan,
>
> Thanks for this.
>
> I hadn't module loaded Cuda in my scripts.  I'm going back and looking through some of the instructions now.  So I guess when using CUDA, it is required to module load cuda/[version]?  I've done as follows:
>
> module load cuda/8.0
> module load nlpl-cupy
>
> When I try module load nlpl-cupy I get the following message:
>
> ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'nlpl-cupy'
>
>
> When I use module -h avail I also don't see nlpl-cupy there.
>
> However, despite that error message it seems (going by my slurm output) that the problem is no longer cupy failing to connect to the CUDA:
>
> -bash-4.1$ cat slurm-26960146.out
>
> ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'nlpl-cupy'
>
> Traceback (most recent call last):
>
>   File "/usit/abel/u1/andidyer/vecmap/map_embeddings.py", line 422, in <module>
>
>     main()
>
>   File "/usit/abel/u1/andidyer/vecmap/map_embeddings.py", line 148, in main
>
>     trg_words, z = embeddings.read(trgfile, dtype=dtype)
>
>   File "/cluster/home/andidyer/vecmap/embeddings.py", line 35, in read
>
>     matrix[i] = np.fromstring(vec, sep=' ', dtype=dtype)
>
> ValueError: could not broadcast input array from shape (125) into shape (300)
>
>
> So something is obviously going right!
>
> Again, many thanks for your assistance.
>
> Best wishes,
>
> Andrew
>
> ________________________________
> From: Stephan Oepen <oe at ifi.uio.no>
> Sent: 09 May 2019 18:17
> To: Andrew Dyer
> Cc: infrastructure at nlpl.eu
> Subject: Re: [NLPL Task Force (A)] Trouble with using GPU in Cupy package
>
> hi andrew,
>
> it appears your local cupy installation is somewhat unable to find its
> external dependencies.  have you 'module load'ed the right CUDA
> version?  are you sure it ended up running on a gpu node?
>
> this things can be tricky to sort out, what with the many different
> (and mutually incompatible) module versions available on a large and
> old system like Abel.
>
> CuPy looks like a relevant tool for the NLPL software inventory, so i
> installed it as an NLPL module.  the following (when running on a gpu
> node) appears to work:
>
> [oe at compute-19-1 ~]$ module purge; module load nlpl-cupy
> module list
> [oe at compute-19-1 ~]$ module list
> Currently Loaded Modulefiles:
>   1) intel/2019.0             4) gcc/4.9.2                7)
> nlpl-cython/0.29.3/3.7
>   2) openssl.intel/1_1_1      5) cuda/9.0                 8)
> nlpl-scipy/201901/3.7
>   3) python3/3.7.0            6) nlpl-numpy/1.16.0/3.7    9) nlpl-cupy/5.4.0/3.7
> [oe at compute-19-1 ~]$ python3 -c "import cupy; print(cupy.__version__);"
> 5.4.0
>
> in general, i would suggest testing things interactive first, before
> you invest the time in putting a job in the queue.  these past few
> days, it appears there can be fairly long wait times for gpu nodes on
> Abel (we are really looking foward to transitioning to the new system
> after the summer).  but in principle, one can create an interactive
> session on a gpu node as follows:
>
> qlogin --account=nn9447k --time=00:30:00 --mem-per-cpu=2048M
> --partition=accel --gres=gpu:1
>
> please see whether the new NLPL version of CuPy works for you (but
> please make sure there are no unwanted interactions with your local
> virtualenv)?
>
> best wishes, oe
>
>
> On Mon, May 6, 2019 at 2:27 PM Andrew Dyer
> <Andrew.Dyer.6854 at student.uu.se> wrote:
> >
> > Hi,
> >
> > Apologies for the bother.  I'm currently trying to run an experiment using GPU nodes in Abel.  The Python program that I am using uses Cupy, which I have installed in my venv with pip.  On my sbatch script, I set the GPU request as instructed on the Job Scripts page:
> >
> > #SBATCH --partition=accel --gres=gpu:1
> >
> > However, the program that I'm using seems to be having trouble connecting to the CUDA software.  I've checked that the versions match (8.0).  I'm at a loss for what else to do though, so any help you can provide would be appreciated.
> >
> > For reference, see attached my script and the error message in the slurm output.
> >
> > Many thanks,
> >
> > Andrew