[NLPL Task Force (A)] Tensorflow issues, pt. 2
Vinit Ravishankar
vinitr at ifi.uio.no
Tue Oct 29 10:27:56 UTC 2019
It turns out that an issue with the Horovod NLPL module is that though horovod appears to have been installed, I cannot use the command-line utility ‘horovodrun’, which is what the code I’m trying to run uses as a launcher. Is there an easy fix for this?
Cheers
– Vinit
> On 25 Oct 2019, at 15:18, Stephan Oepen <oe at ifi.uio.no> wrote:
>
> back to the original thread ...
>
>> I’ve been having more issues with multi-GPU OpenMPI/TensorFlow. I’m using Horovod as an OpenMPI wrapper to train (someone else’s) model, and it doesn’t work when I try running it multi-GPU: I get this:
>>
>> c7-3:24359:24475 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
>
> now that we have sorted out the right selection of versions (i hope),
> i managed to compile horovod; see:
>
> http://wiki.nlpl.eu/index.php/Infrastructure/software/horovod
>
> vinit, would you be in a position to try this module for me? for this
> testing to be pure, i think you should put aside everything conda and
> aim for a clean environment:
>
> $ module purge; module --ignore-cache load nlpl-tensorflow/1.15.0/3.7
> nlpl-horovod/0.18.2/3.7
> $ python3 -c "import horovod.tensorflow as hvd; hvd.init();"
>
> the above seems to work okay for me, after i made the module file work
> around the following issue:
>
> https://github.com/horovod/horovod/issues/1341
>
> i would be happy if you could try actually running something
> (involving multiple gpus :-).
>
> cheers, oe
More information about the infrastructure
mailing list