[NLPL Task Force (A)] Trouble with using GPU in Cupy package

Thu May 9 16:17:26 UTC 2019

hi andrew,

it appears your local cupy installation is somewhat unable to find its
external dependencies.  have you 'module load'ed the right CUDA
version?  are you sure it ended up running on a gpu node?

this things can be tricky to sort out, what with the many different
(and mutually incompatible) module versions available on a large and
old system like Abel.

CuPy looks like a relevant tool for the NLPL software inventory, so i
installed it as an NLPL module.  the following (when running on a gpu
node) appears to work:

[oe at compute-19-1 ~]$ module purge; module load nlpl-cupy
module list
[oe at compute-19-1 ~]$ module list
Currently Loaded Modulefiles:
  1) intel/2019.0             4) gcc/4.9.2                7)
nlpl-cython/0.29.3/3.7
  2) openssl.intel/1_1_1      5) cuda/9.0                 8)
nlpl-scipy/201901/3.7
  3) python3/3.7.0            6) nlpl-numpy/1.16.0/3.7    9) nlpl-cupy/5.4.0/3.7
[oe at compute-19-1 ~]$ python3 -c "import cupy; print(cupy.__version__);"
5.4.0

in general, i would suggest testing things interactive first, before
you invest the time in putting a job in the queue.  these past few
days, it appears there can be fairly long wait times for gpu nodes on
Abel (we are really looking foward to transitioning to the new system
after the summer).  but in principle, one can create an interactive
session on a gpu node as follows:

qlogin --account=nn9447k --time=00:30:00 --mem-per-cpu=2048M
--partition=accel --gres=gpu:1

please see whether the new NLPL version of CuPy works for you (but
please make sure there are no unwanted interactions with your local
virtualenv)?

best wishes, oe

On Mon, May 6, 2019 at 2:27 PM Andrew Dyer
<Andrew.Dyer.6854 at student.uu.se> wrote:
>
> Hi,
>
> Apologies for the bother.  I'm currently trying to run an experiment using GPU nodes in Abel.  The Python program that I am using uses Cupy, which I have installed in my venv with pip.  On my sbatch script, I set the GPU request as instructed on the Job Scripts page:
>
> #SBATCH --partition=accel --gres=gpu:1
>
> However, the program that I'm using seems to be having trouble connecting to the CUDA software.  I've checked that the versions match (8.0).  I'm at a loss for what else to do though, so any help you can provide would be appreciated.
>
> For reference, see attached my script and the error message in the slurm output.
>
> Many thanks,
>
> Andrew