[NLPL Task Force (A)] [uninett.no #196965] Tensorflow issues, pt. 2

Stephan Oepen oe at ifi.uio.no
Fri Oct 25 10:18:07 UTC 2019


hi all,

this problem seems like exactly the kind of challenge that would be
good to understand once and for all, and then make available as a
collection of modules (and usage recipe) for other users, probably
within and beyond the NLPL community.  even though this has been a
busy week for me, i am following closely (for the NLPL infrastructure
task force); thanks for keeping 'infrastructure at nlpl.eu' copied on
this thread!

> I gather that CUDA 10.0 isn’t actually available (it’s just 9 or 10.1)? Is it possible to (somehow) build TF with CUDA 10? As far as I’m aware, there’s no TF version that really supports 10.1 yet.

i had asked for CUDA 10.0 to be installed last week, so a few more of
your prerequisites should be available:

module load CUDA/10.0.130 cuDNN/7.4.2.24-CUDA-10.0.130

i am not sure that NCCL actually is dependent on a specific CUDA
version, but as henrik points out its module wrapper does load CUDA
versions that you probably do not want.  i wonder whether it might
work to do 'surgical' replacement of modules:

$ module purge; module load NCCL/2.4.8-gcccuda-2019a
$ module unload CUDA/10.1.105-GCC-8.2.0-2.31.1
$ module load CUDA/10.0.130 cuDNN/7.4.2.24-CUDA-10.0.130

regarding TensorFlow, is there a specific reason you need a slightly
older version (1.13.1)?  if not, you could try on top of the above (as
an alternative to TensorFlow/1.13.1-fosscuda-2018b-Python-3.6.6):

$ module load nlpl-tensorflow/1.15.0/3.7

the resulting environment for me looks at least plausible:

$ module list

Currently Loaded Modules:
  1) StdEnv                        (S)  11) ncurses/6.1-GCCcore-8.2.0     (H)
  2) GCCcore/8.2.0                      12) libreadline/8.0-GCCcore-8.2.0 (H)
  3) zlib/1.2.11-GCCcore-8.2.0     (H)  13) XZ/5.2.4-GCCcore-8.2.0        (H)
  4) binutils/2.31.1-GCCcore-8.2.0 (H)  14) GMP/6.1.2-GCCcore-8.2.0       (H)
  5) GCC/8.2.0-2.31.1                   15) libffi/3.2.1-GCCcore-8.2.0    (H)
  6) gcccuda/2019a                      16) Python/3.7.2-GCCcore-8.2.0
  7) NCCL/2.4.8-gcccuda-2019a           17) nlpl-python-candy/201910/3.7
  8) CUDA/10.0.130                      18) nlpl-numpy/1.17.3/3.7
  9) cuDNN/7.4.2.24-CUDA-10.0.130       19) nlpl-scipy/201910/3.7
 10) bzip2/1.0.6-GCCcore-8.2.0     (H)  20) nlpl-tensorflow/1.15.0/3.7

next, i am a bit unsure which OpenMPI to throw into the mix?
OpenMPI/3.1.1-gcccuda-2018b seems to be the only available version
built on top of gcccuda, but this is a mildly outdated version.
OpenMPI/3.1.3-GCC-8.2.0-2.31.1 also looks tempting (because it matches
the GCC lineage behind our current Python 3.7 on Saga), but the
Horovod installation notes recommend against OpenMPI 3.1.3.  if i were
you, i would maybe try both, but in each case you will need to make
sure to reset your module environment properly ('module purge') and
pick all the right versions.

to compile Horovod, i suspect you may need to point it to NCCL in a
non-standard location:

$ HOROVOD_NCCL_HOME=/cluster/software/NCCL/2.4.8-gcccuda-2019a ...

hth; keep up the good work!  oe




More information about the infrastructure mailing list