[NLPL Task Force (A)] Tensorflow issues, pt. 2
Vinit Ravishankar
vinitr at ifi.uio.no
Fri Oct 25 18:56:13 UTC 2019
Not CCing metacentre here because it probably isn’t super relevant to them, but issue with the module method:
$ pip3 install --user -r requirements.txt
ERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv.
I definitely remember seeing something similar pop up on Abel at some point, but unfortunately cannot seem to recall how I fixed it back then.
– Vinit
> On 25 Oct 2019, at 15:18, Stephan Oepen <oe at ifi.uio.no> wrote:
>
> back to the original thread ...
>
>> I’ve been having more issues with multi-GPU OpenMPI/TensorFlow. I’m using Horovod as an OpenMPI wrapper to train (someone else’s) model, and it doesn’t work when I try running it multi-GPU: I get this:
>>
>> c7-3:24359:24475 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
>
> now that we have sorted out the right selection of versions (i hope),
> i managed to compile horovod; see:
>
> http://wiki.nlpl.eu/index.php/Infrastructure/software/horovod
>
> vinit, would you be in a position to try this module for me? for this
> testing to be pure, i think you should put aside everything conda and
> aim for a clean environment:
>
> $ module purge; module --ignore-cache load nlpl-tensorflow/1.15.0/3.7
> nlpl-horovod/0.18.2/3.7
> $ python3 -c "import horovod.tensorflow as hvd; hvd.init();"
>
> the above seems to work okay for me, after i made the module file work
> around the following issue:
>
> https://github.com/horovod/horovod/issues/1341
>
> i would be happy if you could try actually running something
> (involving multiple gpus :-).
>
> cheers, oe
More information about the infrastructure
mailing list