[NLPL Task Force (A)] [rt.uio.no #3396801] newer CUDA versions on Abel
Stephan Oepen via RT
hpc-drift at usit.uio.no
Wed May 15 23:20:33 UTC 2019
many thanks for yet another CUDA module, ole!
CUDA 10.0 (combined with the cuDNN libraries from the CUDA 9.1 module)
seems to satisfy TensorFlow 1.13. however, its pre-compiled wheel
requires a minimum glibc version of 2.23, so i have started upgrading
the NLPL-internal copy of glibc. just so you know this ticket is
costing us non-trivial effort as well :-).
we are hoping to start transitioning to Saga this summer; NLPL shall
be among the pilot projects helping to 'burn-in' the new machine. my
understanding was that Saga initially will have 32 V100 gpus, which
would at least be an improvement over the current situation. my
expectation is that the cpu-to-gpu ratio in future systems should
increasingly tilt towards gpus. the Taito successor, for example,
will come with an 'AI partition' comprising 320 V100 gpus ...
we are also evaluating the NIRD Toolkit, but so far it appears to lack
a transparent allocation and scheduling model. and there we would
miss the accumulated NLPL project directory, both our data and
software.
cheers, oe
On Tue, May 14, 2019 at 8:30 AM Ole Saastad via RT
<hpc-drift at usit.uio.no> wrote:
>
>
> Prøver igjen,
>
>
> [root at nielshenrik ~]# find /cluster/software/VERSIONS/cuda-10.0/ -name libcublas.so.10.0
> /cluster/software/VERSIONS/cuda-10.0/lib64/libcublas.so.10.0
>
> Rart at de prehistoriske kortene støtter CUDA 10 :)
>
> Kanskje det virker nå, men driverene er ikke oppdatert. Det er et litt
> større lerret å bleke.
>
> På sikt må dere tenke på hvordan og hvor dere vil kjøre, er dere
> avhengig av kontinuerlig tilgang er kanskje ikke Saga eller
> tjenesteplattformen (SP) det beste. Men for tjenester er selvsagt SP
> den beste løsningen. SP tilbys kun tjenester, kubernetes.
>
>
> mvh,
> Ole
>
>
>
> On Mon, 2019-05-13 at 23:18 +0200, Stephan Oepen via RT wrote:
> > many thanks, ole! i tried installing TensorFlow against the new CUDA
> > module, but unfortunately TensorFlow and i appear a little difficult
> > to please:
> >
> > ImportError: libcublas.so.10.0: cannot open shared object file: No
> > such file or directory
> >
> > they do in fact specify CUDA 10.0 as the requirement for the latest
> > TensorFlow release (1.13), so i guess they are just being cautious
> > and
> > stubborn.
> >
> > could you bring yourself to also putting the CUDA 10.0 libraries into
> > yet another module?
> >
> > with thanks in advance, oe
> >
> > On Mon, May 13, 2019 at 9:26 AM Ole Saastad via RT
> > <hpc-drift at usit.uio.no> wrote:
> > >
> > > On Sun, 2019-05-12 at 13:17 +0200, Stephan Oepen via RT wrote:
> > > > >
> > > >
> > > > so, is it actually not so easy to just provision the CUDA 10
> > > > libraries
> > > > as another module, without touching the drivers?
> > >
> > > Done (10.1, not tested), this is just installing some software in a
> > > directory. Updating the driver is a bit more intrusive. It require
> > > downtime and reboot of the gpu compute nodes.
> > >
> > > > or are we running up
> > > > against a matter of principle here, no more software updates on
> > > > Abel?
> > > >
> > >
> > > Yes, you are right, with only months left og it's lifetime we spend
> > > most of our effort on installing and setting up the new systems.
> > > Plans was to shut down Abel more than a year ago.
> > >
> > > Regards,
> > > Ole
> > >
> > >
> > > > cheers, oe
> > > >
> > > >
> > >
> > > --
> > > Ole W. Saastad, Dr.Scient.
> > > UiO/USIT/UVA/ITF/FI
> > > Besøk: Kristen Nygaards hus - Rom 2315
> > > Post: Gaustadalléen 23A, 0349 Oslo
> > > USIT, Postboks 1059 Blindern, 0316 Oslo
> > > Tel: +47-22840752
> > >
> > >
> > >
> >
> >
> --
> Ole W. Saastad, Dr.Scient.
> UiO/USIT/UVA/ITF/FI
> Besøk: Kristen Nygaards hus - Rom 2315
> Post: Gaustadalléen 23A, 0349 Oslo
> USIT, Postboks 1059 Blindern, 0316 Oslo
> Tel: +47-22840752
>
>
>
More information about the infrastructure
mailing list