[NLPL Task Force (A)] Tensorflow issues, pt. 2

Vinit Ravishankar vinitr at ifi.uio.no
Thu Oct 24 10:38:17 UTC 2019


Hi all,

I’ve been having more issues with multi-GPU OpenMPI/TensorFlow. I’m using Horovod as an OpenMPI wrapper to train (someone else’s) model, and it doesn’t work when I try running it multi-GPU: I get this:

c7-3:24359:24475 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
c7-3:24359:24475 [1] NCCL INFO enqueue.cc:438 -> 1

along with this:

ValueError: Operation name: "NoOp"
op: "NoOp"
 is not an element of this graph.

and, eventually, this:

tensorflow.python.framework.errors_impl.UnknownError: ncclAllReduce failed: unhandled cuda error
         [[{{node update/lazy_adam/aggregate/HorovodAllreduce_update_lazy_adam_gradients_counters_logits_MatMul_grad_tuple_control_dependency_1_0}}]]


(massively truncating here)

Any idea what the issue could be? Cheers.

– Vinit





More information about the infrastructure mailing list