[NLPL Task Force (A)] [uninett.no #196965] Tensorflow issues, pt. 2

Vinit Ravishankar vinitr at ifi.uio.no
Fri Oct 25 11:06:21 UTC 2019


Incidentally, I’m still having issues loading just TensorFlow, if it’s v1.13.1.

Current setup: conda environment, python 3.6, installed tensorflow with pip (not conda), i.e.:

pip install tensorflow-gpu==1.13.1

Modules:

module load CUDA/10.0.130 cuDNN/7.4.2.24-CUDA-10.0.130
module --ignore-cache load NCCL/2.4.8-CUDA-10.0

Error:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

I’ll try again with TensorFlow installed with conda, but seeing as I had the same error with that version yesterday, I’m not expecting it to change. This is a fresh conda environment, so it ought not to have any CUDA paths that’d override the module ones, right?

– Vinit

> On 25 Oct 2019, at 13:01, oe at ifi.uio.no via RT <support at metacenter.no> wrote:
> 
>> i am not sure that NCCL actually is dependent on a specific CUDA
>> version, but as henrik points out its module wrapper does load CUDA
>> versions that you probably do not want.  i wonder whether it might
>> work to do 'surgical' replacement of modules:
> 
> looking a little more at NCCL, it does sound as if it may be dependent
> on one specific CUDA version, at least there are different download
> packages for NCCL against CUDA 10.0 vs. 10.1.  so maybe we need to ask
> for an additional module to be installed, something like
> NCCL/2.4.8-CUDA-10.0?
> 
> henrik or thomas, if you agree that bifurcating NCCL according to CUDA
> versions will be required, could you see to the creation of such a
> module?
> 
> cheers, oe
> 
> 





More information about the infrastructure mailing list