[NLPL Task Force (A)] [uninett.no #196625] Issues with TensorFlow on Saga

Vinit Ravishankar vinitr at ifi.uio.no
Sun Oct 20 20:36:22 UTC 2019


Did it with HOROVOD_WITH_TENSORFLOW=1

– Vinit

> On 20 Oct 2019, at 22:35, Thomas Röblitz via RT <support at metacenter.no> wrote:
> 
> Great! Just for the record, could you let me know what additional flags you needed?
> 
> Thomas
> 
> Am So 20. Okt 22:34:16 2019, vinitr at ifi.uio.no schrieb:
>> Yeah, I did try installing horovod myself and while it needed
>> additional flags, it looks like it’s working now. Thanks!
>> 
>> – Vinit
>> 
>>> On 20 Oct 2019, at 21:24, Thomas Röblitz via RT
>>> <support at metacenter.no> wrote:
>>> 
>>> No problem. This solves your problems 1) and 2) ?
>>> 
>>> Concerning horovod, could you try to install this yourself? According
>>> to https://github.com/horovod/horovod#id7 it may be doable. Didn't
>>> try myself yet.
>>> 
>>> Enjoy your evening
>>> 
>>> Thomas
>>> 
>>> Am So 20. Okt 20:56:58 2019, vinitr at ifi.uio.no schrieb:
>>>> Right, turns out this was my fault — I was being extremely stupid
>>>> and
>>>> running this in a shell, which meant it was running on CPU :-) gave
>>>> it
>>>> a shot and it seems to work, thanks!
>>>> 
>>>> – Vinit
>>>> 
>>>>> On 20 Oct 2019, at 20:45, Thomas Röblitz via RT
>>>>> <support at metacenter.no> wrote:
>>>>> 
>>>>> Hei Vinit,
>>>>> 
>>>>> I'm lacking a bit detail here to reproduce the behaviour you
>>>>> experience. For example, when I do
>>>>> 
>>>>> # module load TensorFlow/1.13.1-fosscuda-2018b-Python-3.6.6
>>>>> # which python
>>>>> /cluster/software/Python/3.6.6-fosscuda-2018b/bin/python
>>>>> # salloc --account=nn9999k --time=02:00:00 --nodes=1
>>>>> --partition=accel --gres=gpu:1 --ntasks-per-node=1 --mem=32G
>>>>> 
>>>>> I get an interactive job on one of the GPU nodes. Then, when I
>>>>> start
>>>>> python via
>>>>> 
>>>>> # srun --pty python
>>>>> Python 3.6.6 (default, Aug  9 2019, 16:46:08)
>>>>> [GCC 7.3.0] on linux
>>>>> Type "help", "copyright", "credits" or "license" for more
>>>>> information.
>>>>>>>> import tensorflow as tf
>>>>>>>> 
>>>>> 
>>>>> it seems to work (at least no error messages).
>>>>> 
>>>>> So, likely you do something different. If you provide more details,
>>>>> e.g., sequence of commands until you get to the error messages, I
>>>>> can
>>>>> have a look into the problem.
>>>>> 
>>>>> Best regards
>>>>> 
>>>>> Thomas
>>>>> 
>>>>> Am So 20. Okt 13:45:02 2019, vinitr at ifi.uio.no schrieb:
>>>>>> Hi all,
>>>>>> 
>>>>>> I’ve been having some issues getting (other people’s) projects in
>>>>>> TensorFlow to run on GPU. There’s two scenarios here:
>>>>>> 
>>>>>> 1. My own anaconda environment with TensorFlow installed manually
>>>>>> (this works fine for PyTorch, and, indeed, is my normal workflow):
>>>>>> ImportError: /lib64/libm.so.6: version `GLIBC_2.23' not found
>>>>>> 
>>>>>> 2. Using the tensorflow module: ImportError: libcuda.so.1: cannot
>>>>>> open
>>>>>> shared object file: No such file or directory
>>>>>> 
>>>>>> (2) is despite CUDA being loaded by the module (as far as I can
>>>>>> tell,
>>>>>> anyway).
>>>>>> 
>>>>>> How do I solve this? Additionally, it would also be cool to get
>>>>>> multi-
>>>>>> GPU support with Horovod (https://github.com/horovod/horovod),
>>>>>> something I don’t believe works at the moment.
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> – Vinit
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> 





More information about the infrastructure mailing list