[NLPL Task Force (A)] Tensorflow issues, pt. 2

Vinit Ravishankar vinitr at ifi.uio.no
Tue Oct 29 10:27:56 UTC 2019


It turns out that an issue with the Horovod NLPL module is that though horovod appears to have been installed, I cannot use the command-line utility ‘horovodrun’, which is what the code I’m trying to run uses as a launcher. Is there an easy fix for this?

Cheers

– Vinit

> On 25 Oct 2019, at 15:18, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
> back to the original thread ...
> 
>> I’ve been having more issues with multi-GPU OpenMPI/TensorFlow. I’m using Horovod as an OpenMPI wrapper to train (someone else’s) model, and it doesn’t work when I try running it multi-GPU: I get this:
>> 
>> c7-3:24359:24475 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
> 
> now that we have sorted out the right selection of versions (i hope),
> i managed to compile horovod; see:
> 
> http://wiki.nlpl.eu/index.php/Infrastructure/software/horovod
> 
> vinit, would you be in a position to try this module for me?  for this
> testing to be pure, i think you should put aside everything conda and
> aim for a clean environment:
> 
> $ module purge; module --ignore-cache load nlpl-tensorflow/1.15.0/3.7
> nlpl-horovod/0.18.2/3.7
> $ python3 -c "import horovod.tensorflow as hvd; hvd.init();"
> 
> the above seems to work okay for me, after i made the module file work
> around the following issue:
> 
> https://github.com/horovod/horovod/issues/1341
> 
> i would be happy if you could try actually running something
> (involving multiple gpus :-).
> 
> cheers, oe





More information about the infrastructure mailing list