[NLPL Task Force (A)] [uninett.no #211080] NCCL on Saga (for use with CUDA 10.2)
oe@ifi.uio.no via RT
metacenter-software at metacenter.no
Tue May 12 11:08:10 UTC 2020
hi vegard,
> Are you sure about the CUDA dependency? It seems to me that for each version
> of NCCL at NVIDIA there are different packages for download for different CUDA
> versions?
no, i am not quite sure about interoperability in the complex world of
NVDIA libraries :-). i recall i was looking at a NCCL compatibility
chart yesterday, where i thought they indicated 9.0 through 10.2 for
both NCCL versions 2.5 and 2.6. but i cannot find again that page
today. looking at the individual release notes and NCCL installation
guides, today, they seem to say CUDA 9.0 or higher for NCCL 2.5.4 and
CUDA 10.0 or higher for NCCL 2.6.5:
https://docs.nvidia.com/deeplearning/nccl/archives/nccl_256/nccl-release-notes/rel_2-5-6.html#rel_2-5-6
https://docs.nvidia.com/deeplearning/nccl/archives/nccl_264/nccl-release-notes/rel_2-6-x.html#rel_2-6-x
so, yes, my impression still is that NCCL is not intricately linked to
each specific minor CUDA release version, but there are most likely
interoperability constraints.
i think this raises interesting design questions for how to structure
the module hierarchy (i am working in parallel with colleagues under
the EOSC-Nordic umbrella towards a recipe of maintaining exactly
parallel software environments on multiple systems using EasyBuild; in
fact, i have been meaning to reach out to you guys about some of the
choices for a while).
if you were to install 2.6.5 without any specific dependencies on Saga
now, there is no safety net for users against combining it with CUDA
9.0, which at least officially is not a supported configuration. on
the other hand, tying NCCL to one specific CUDA version would require
cross-multiplication of modules, e.g. i think i would like to have
available at least NCCL {2.5 2.6} x CUDA {10.0 10.1 102}, i.e. a total
of six separate modules. assuming we trust the NCCL release notes
above, my ideal solution would be one 'intellgent' NCCL module: for
example, NCCL 2.6.5 could test for the presence of a suitable
pre-loaded CUDA module (as of today: 10.0, 10.1, or 10.2) and refuse
to load (with an informative message :-) when this condition is not
met.
what do you make of these choices (also as input to our EasyBuild
pilot in the EOSC-Nordic pilot)? seeing as there are several layers
of prerequisite dependencies already (e.g. the toolchain, MKL vs.
OpenBLAS, different MPI or CUDA versions), my general attitude has
been to try and not make module dependencies more specific than they
need to be ...
best wishes, oe
More information about the infrastructure
mailing list