[NLPL Task Force (A)] [usit-vd-rt] [rt.uio.no #3396801] newer CUDA versions on Abel

Thu May 16 06:05:15 UTC 2019

Updating glic is non-trivial, sometimes it works using LD_PRELOAD
sometimes it just go down the drain (swearing omitted).

Looking forward, 
1)
Saga is up, but faces some issues on the storage part of it.

The GPUs are (from the documents) :
"Each Server will deploy 4x HPE
NVIDIA Tesla P100 PCIe 16GB
Computational Accelerator."

This is a major step up from Abel.

2)
NIRD, the service platform is a platform for services.
NLPL is if I have understood is a service.
As NIRD is only supposed to run services (kubernetes on bare metal, no
OpenStack virtual machines) it should be the perfect platform for
NLPL. If not, something is not right.

3)
We are looking into other forms of accelerators here at UiO.
Common for all alternatives is their ability to run tensorflow
(if not they're out). Some from China, but also ideas from EURO-HPC
where ARM and RISC-V are on the roadmap.

4)
It is possible to set up a local infrastructure for NLPL it's mostly a
question of funding. As long as it do not compete with Sigma2 it's ok.

Regards,
Ole

On Thu, 2019-05-16 at 01:20 +0200, Stephan Oepen via RT wrote:
> <URL: https://rt.uio.no/Ticket/Display.html?id=3396801 >
> 
> many thanks for yet another CUDA module, ole!
> 
> CUDA 10.0 (combined with the cuDNN libraries from the CUDA 9.1
> module)
> seems to satisfy TensorFlow 1.13.  however, its pre-compiled wheel
> requires a minimum glibc version of 2.23, so i have started upgrading
> the NLPL-internal copy of glibc.  just so you know this ticket is
> costing us non-trivial effort as well :-).
> 
> we are hoping to start transitioning to Saga this summer; NLPL shall
> be among the pilot projects helping to 'burn-in' the new machine.  my
> understanding was that Saga initially will have 32 V100 gpus, which
> would at least be an improvement over the current situation.  my
> expectation is that the cpu-to-gpu ratio in future systems should
> increasingly tilt towards gpus.  the Taito successor, for example,
> will come with an 'AI partition' comprising 320 V100 gpus ...
> 
> we are also evaluating the NIRD Toolkit, but so far it appears to
> lack
> a transparent allocation and scheduling model.  and there we would
> miss the accumulated NLPL project directory, both our data and
> software.
> 
> cheers, oe
> 
> On Tue, May 14, 2019 at 8:30 AM Ole Saastad via RT
> <hpc-drift at usit.uio.no> wrote:
> > 
> > 
> > Prøver igjen,
> > 
> > 
> > [root at nielshenrik ~]# find /cluster/software/VERSIONS/cuda-10.0/
> > -name libcublas.so.10.0
> > /cluster/software/VERSIONS/cuda-10.0/lib64/libcublas.so.10.0
> > 
> > Rart at de prehistoriske kortene støtter CUDA 10 :)
> > 
> > Kanskje  det virker nå, men driverene er ikke oppdatert. Det er et
> > litt
> > større lerret å bleke.
> > 
> > På sikt må dere tenke på hvordan og hvor dere vil kjøre, er dere
> > avhengig av kontinuerlig tilgang er kanskje ikke Saga eller
> > tjenesteplattformen (SP) det beste. Men for tjenester er selvsagt
> > SP
> > den beste løsningen. SP tilbys kun tjenester, kubernetes.
> > 
> > 
> > mvh,
> > Ole
> > 
> > 
> > 
> > On Mon, 2019-05-13 at 23:18 +0200, Stephan Oepen via RT wrote:
> > > many thanks, ole!  i tried installing TensorFlow against the new
> > > CUDA
> > > module, but unfortunately TensorFlow and i appear a little
> > > difficult
> > > to please:
> > > 
> > > ImportError: libcublas.so.10.0: cannot open shared object file:
> > > No
> > > such file or directory
> > > 
> > > they do in fact specify CUDA 10.0 as the requirement for the
> > > latest
> > > TensorFlow release (1.13), so i guess they are just being
> > > cautious
> > > and
> > > stubborn.
> > > 
> > > could you bring yourself to also putting the CUDA 10.0 libraries
> > > into
> > > yet another module?
> > > 
> > > with thanks in advance, oe
> > > 
> > > On Mon, May 13, 2019 at 9:26 AM Ole Saastad via RT
> > > <hpc-drift at usit.uio.no> wrote:
> > > > 
> > > > On Sun, 2019-05-12 at 13:17 +0200, Stephan Oepen via RT wrote:
> > > > > > 
> > > > > 
> > > > > so, is it actually not so easy to just provision the CUDA 10
> > > > > libraries
> > > > > as another module, without touching the drivers?
> > > > 
> > > > Done (10.1, not tested), this is just installing some software
> > > > in a
> > > > directory. Updating the driver is a bit more intrusive. It
> > > > require
> > > > downtime and reboot of the gpu compute nodes.
> > > > 
> > > > > or are we running up
> > > > > against a matter of principle here, no more software updates
> > > > > on
> > > > > Abel?
> > > > > 
> > > > 
> > > > Yes, you are right, with only months left og it's lifetime we
> > > > spend
> > > > most of our effort on installing and setting up the new
> > > > systems.
> > > > Plans was to shut down Abel more than a year ago.
> > > > 
> > > > Regards,
> > > > Ole
> > > > 
> > > > 
> > > > > cheers, oe
> > > > > 
> > > > > 
> > > > 
> > > > --
> > > > Ole W. Saastad, Dr.Scient.
> > > > UiO/USIT/UVA/ITF/FI
> > > > Besøk: Kristen Nygaards hus - Rom 2315
> > > > Post: Gaustadalléen 23A, 0349 Oslo
> > > > USIT, Postboks 1059 Blindern, 0316 Oslo
> > > > Tel: +47-22840752
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > --
> > Ole W. Saastad, Dr.Scient.
> > UiO/USIT/UVA/ITF/FI
> > Besøk: Kristen Nygaards hus - Rom 2315
> > Post: Gaustadalléen 23A, 0349 Oslo
> > USIT, Postboks 1059 Blindern, 0316 Oslo
> > Tel: +47-22840752
> > 
> > 
> > 
> 
> 
-- 
Ole W. Saastad, Dr.Scient.
UiO/USIT/UVA/ITF/FI
Besøk: Kristen Nygaards hus - Rom 2315
Post: Gaustadalléen 23A, 0349 Oslo
USIT, Postboks 1059 Blindern, 0316 Oslo
Tel: +47-22840752