[NLPL Task Force (A)] [usit-vd-rt] [rt.uio.no #3396801] newer CUDA versions on Abel

Thu May 16 08:32:22 UTC 2019

good morning,

> Updating glic is non-trivial, sometimes it works using LD_PRELOAD
> sometimes it just go down the drain (swearing omitted).

oh, that drain looks like a lot of fun to me :-).  i would be
reluctant to ask anyone but myself to go there!

NLPL has its own wrappers for newer glibc versions in place; please
see the shell script:

/projects/nlpl/software/tensorflow/1.13/bin/3.7/python3

so, we are nearing the completion of this thread: i can now install a
working TensorFlow 1.13 ('nlpl-tensorflow/1.13/3.7').  and it reveals
that the current driver version on the gpu nodes is not supported by
CUDA 10:

[oe at compute-19-5 ~]$ module purge; module load nlpl-tensorflow/1.13/3.7
[oe at compute-19-5 ~]$ python3 /projects/nlpl/software/tensorflow/1.13/test.py
[...]
2019-05-16 10:10:21.295841: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1
with properties:
name: Tesla K20Xm major: 3 minor: 5 memoryClockRate(GHz): 0.732
pciBusID: 0000:84:00.0
totalMemory: 5.57GiB freeMemory: 5.49GiB
2019-05-16 10:10:21.295936: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible
gpu devices: 0, 1
[...]
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice()
failed. Status: CUDA driver version is insufficient for CUDA runtime
version

seeing as you need not worry about that glibc drain on our behalf, is
there any chance the drivers on Abel gpu nodes could still be updated?
 or should NLPL declare this latest TensorFlow installation cpu-only
and hope to gain trial access to Saga in the next few weeks?  the user
causing us all of this pain (NLPL project member andrey kutuzov,
copied) is signed up for the Saga trial, i think.

> Looking forward,

> 2)
> NIRD, the service platform is a platform for services.
> NLPL is if I have understood is a service.

no, that feels like a misunderstanding.  NLPL is a user community in
denmark, finland, norway, and sweden doing increasingly data- and
computing-intensive research (a bit like the bio-informatics folks,
maybe, but we seem to be more 'AI-heavy').  our resource needs
currently are non-trivial but not scary: tens of terabytes of data and
millions of core hours per year.  we are reasonably technical; running
batch jobs on machines like Abel and Taito feels like a good paradigm.

there are 40-50 active NLPL users on Abel currently, but i hope you do
not see that much of this activity in terms of support requests?!
these users go to the NLPL infrastructure task force (copied) as their
first line of support.

> 4)
> It is possible to set up a local infrastructure for NLPL it's mostly a
> question of funding. As long as it do not compete with Sigma2 it's ok.

thanks :-).  i like to describe NLPL as a self-help initiative:
enabling our user community to do a good part of the installation and
support work among ourselves.  for the time being at least, Sigma2
(who are a member of the NLPL consortium) seem supportive to that
idea.  the special twist in NLPL is the cross-border collaboration.
for example, some of the NLPL software modules on Abel are maintained
by researchers in finland and sweden; two of my doctoral students have
been running their gpu jobs on Taito for the past year or so.

greetings from the NeIC Conference :-), oe