[NLPL Task Force (A)] [uninett.no #196625] Issues with TensorFlow on Saga

Vinit Ravishankar via RT support at metacenter.no
Tue Oct 22 12:19:27 UTC 2019


Sure, let me resolve a couple of issues I’ve been having with OpenMPI and I’ll get right to it.

– Vinit

> On 21 Oct 2019, at 20:40, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
> hi vinit,
> 
>> Right, turns out this was my fault — I was being extremely stupid and running this in a shell, which meant it was running on CPU :-) gave it a shot and it seems to work, thanks!
> 
> in case it makes you feel a little better: older installations of
> TensorFlow (out of the box) used to be either cpu-only or gpu-only.
> on Abel, i had tweaked the NLPL installations of TensorFlow to
> actually work in both environments, which i suspect might have given
> you a false sense of not having to think about where you test
> TensorFlow code.
> 
> as far as i understand it, newer versions have eliminated this
> inconvenience.  we have only just started to put NLPL modules on Saga,
> but it appears that both TensorFlow 1.15.0 and 2.0.0 (out of the box)
> work on either the cpu or gpu nodes.
> 
> even though i assume you are a happy conda user on Saga, i would be
> grateful if you could give the NLPL TensorFlow modules a shot with
> your code?  we plan to put the exact same collection of modules and
> versions on Saga and Puhti, so hopefully the NLPL modules will soon
> aid mobility (and replicability) across systems.
> 
> $ module use -a /cluster/shared/nlpl/software/modules/etc
> $ module --ignore_cache load nlpl-tensorflow/1.15.0/3.7
> 
> unlike it used to be on Abel, the NLPL Python add-on modules should
> also allow derived virtual environments.  i have yet to confirm that,
> but at least what used to be a major obstacle (having to work around
> the system glibc version) is no longer an issue on Saga.
> 
> cheers, oe
> 
> ps: for my own records:
> 
> for i in TensorFlow/1.13.1-fosscuda-2018b-Python-3.6.6
> nlpl-tensorflow/1.15.0/3.7 nlpl-tensorflow/2.0.0/3.7 ; do echo $i;
> module purge; module --ignore_cache load $i; python3 <
> /cluster/shared/nlpl/operation/python/test/tensorflow.py ; done >
> /tmp/cpu.log 2>&1





More information about the infrastructure mailing list