[NLPL Task Force (A)] Tensorflow issues, pt. 2
Vinit Ravishankar
vinitr at ifi.uio.no
Thu Oct 24 10:38:17 UTC 2019
Hi all,
I’ve been having more issues with multi-GPU OpenMPI/TensorFlow. I’m using Horovod as an OpenMPI wrapper to train (someone else’s) model, and it doesn’t work when I try running it multi-GPU: I get this:
c7-3:24359:24475 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
c7-3:24359:24475 [1] NCCL INFO enqueue.cc:438 -> 1
along with this:
ValueError: Operation name: "NoOp"
op: "NoOp"
is not an element of this graph.
and, eventually, this:
tensorflow.python.framework.errors_impl.UnknownError: ncclAllReduce failed: unhandled cuda error
[[{{node update/lazy_adam/aggregate/HorovodAllreduce_update_lazy_adam_gradients_counters_logits_MatMul_grad_tuple_control_dependency_1_0}}]]
(massively truncating here)
Any idea what the issue could be? Cheers.
– Vinit
More information about the infrastructure
mailing list