[NLPL Task Force (A)] Tensorflow issues, pt. 2

Stephan Oepen oe at ifi.uio.no
Fri Oct 25 13:18:19 UTC 2019


back to the original thread ...

> I’ve been having more issues with multi-GPU OpenMPI/TensorFlow. I’m using Horovod as an OpenMPI wrapper to train (someone else’s) model, and it doesn’t work when I try running it multi-GPU: I get this:
>
> c7-3:24359:24475 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'

now that we have sorted out the right selection of versions (i hope),
i managed to compile horovod; see:

http://wiki.nlpl.eu/index.php/Infrastructure/software/horovod

vinit, would you be in a position to try this module for me?  for this
testing to be pure, i think you should put aside everything conda and
aim for a clean environment:

$ module purge; module --ignore-cache load nlpl-tensorflow/1.15.0/3.7
nlpl-horovod/0.18.2/3.7
$ python3 -c "import horovod.tensorflow as hvd; hvd.init();"

the above seems to work okay for me, after i made the module file work
around the following issue:

https://github.com/horovod/horovod/issues/1341

i would be happy if you could try actually running something
(involving multiple gpus :-).

cheers, oe




More information about the infrastructure mailing list