[NLPL Task Force (A)] Tensorflow issues, pt. 2

Vinit Ravishankar vinitr at ifi.uio.no
Fri Oct 25 18:56:13 UTC 2019


Not CCing metacentre here because it probably isn’t super relevant to them, but issue with the module method:

$ pip3 install --user -r requirements.txt                                                                                          
ERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv.

I definitely remember seeing something similar pop up on Abel at some point, but unfortunately cannot seem to recall how I fixed it back then.

– Vinit

> On 25 Oct 2019, at 15:18, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
> back to the original thread ...
> 
>> I’ve been having more issues with multi-GPU OpenMPI/TensorFlow. I’m using Horovod as an OpenMPI wrapper to train (someone else’s) model, and it doesn’t work when I try running it multi-GPU: I get this:
>> 
>> c7-3:24359:24475 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
> 
> now that we have sorted out the right selection of versions (i hope),
> i managed to compile horovod; see:
> 
> http://wiki.nlpl.eu/index.php/Infrastructure/software/horovod
> 
> vinit, would you be in a position to try this module for me?  for this
> testing to be pure, i think you should put aside everything conda and
> aim for a clean environment:
> 
> $ module purge; module --ignore-cache load nlpl-tensorflow/1.15.0/3.7
> nlpl-horovod/0.18.2/3.7
> $ python3 -c "import horovod.tensorflow as hvd; hvd.init();"
> 
> the above seems to work okay for me, after i made the module file work
> around the following issue:
> 
> https://github.com/horovod/horovod/issues/1341
> 
> i would be happy if you could try actually running something
> (involving multiple gpus :-).
> 
> cheers, oe





More information about the infrastructure mailing list