[NLPL Task Force (A)] [uninett.no #196625] Issues with TensorFlow on Saga

oe@ifi.uio.no via RT support at metacenter.no
Mon Oct 21 18:41:06 UTC 2019


hi vinit,

> Right, turns out this was my fault — I was being extremely stupid and running this in a shell, which meant it was running on CPU :-) gave it a shot and it seems to work, thanks!

in case it makes you feel a little better: older installations of
TensorFlow (out of the box) used to be either cpu-only or gpu-only.
on Abel, i had tweaked the NLPL installations of TensorFlow to
actually work in both environments, which i suspect might have given
you a false sense of not having to think about where you test
TensorFlow code.

as far as i understand it, newer versions have eliminated this
inconvenience.  we have only just started to put NLPL modules on Saga,
but it appears that both TensorFlow 1.15.0 and 2.0.0 (out of the box)
work on either the cpu or gpu nodes.

even though i assume you are a happy conda user on Saga, i would be
grateful if you could give the NLPL TensorFlow modules a shot with
your code?  we plan to put the exact same collection of modules and
versions on Saga and Puhti, so hopefully the NLPL modules will soon
aid mobility (and replicability) across systems.

$ module use -a /cluster/shared/nlpl/software/modules/etc
$ module --ignore_cache load nlpl-tensorflow/1.15.0/3.7

unlike it used to be on Abel, the NLPL Python add-on modules should
also allow derived virtual environments.  i have yet to confirm that,
but at least what used to be a major obstacle (having to work around
the system glibc version) is no longer an issue on Saga.

cheers, oe

ps: for my own records:

for i in TensorFlow/1.13.1-fosscuda-2018b-Python-3.6.6
nlpl-tensorflow/1.15.0/3.7 nlpl-tensorflow/2.0.0/3.7 ; do echo $i;
module purge; module --ignore_cache load $i; python3 <
/cluster/shared/nlpl/operation/python/test/tensorflow.py ; done >
/tmp/cpu.log 2>&1





More information about the infrastructure mailing list