[NLPL Task Force (A)] NLPL modules python issue
Stephan Oepen
oe at ifi.uio.no
Thu Jan 30 18:28:24 UTC 2020
okay, please try again:
module purge; module load nlpl-tensorflow/1.15.2/3.7
oe
On Tue, Jan 28, 2020 at 11:00 AM Stephan Oepen <oe at ifi.uio.no> wrote:
>
> do you have a minimum test case that i could use to validate a new
> (forthcoming) TF installation? i use the script in
> $NLPLROOT/operation/python/test/tensorflow.py, which confirms basic
> CUDA functionality but apparently does not validate the parts that
> depend on cuDNN ... something involving convolutions, i suspect?
>
> i expect we will see cuDNN 7.6 and CUDA 10.1 decoupled during the day
> today, so then i should be able to produce a fresh install of TF
> 1.15.2.
>
> oe
>
>
> On Mon, Jan 27, 2020 at 4:05 PM Andrei Kutuzov <andreku at ifi.uio.no> wrote:
> >
> > Hi,
> >
> > This story will never end :)
> >
> > Even without the dependency on GCC 8.3, the new Saga modules
> > cuDNN/7.6.4.38-CUDA-10.1.243 and CUDA/10.1.243 conflict with
> > nlpl-tensorflow/1.15.0/3.7
> > The reason seems to be that the TF module is compiled using CUDA
> > 10.0.130, not CUDA 10.1.243. Thus, nlpl-tensorflow/1.15.0/3.7 and
> > CUDA/10.1.243 cannot be loaded together.
> >
> > 27.01.2020 11:04, Andrei Kutuzov wrote:
> > > Hi Stephan,
> > >
> > > The correct version of CuDNN is now installed on Saga, and I can load
> > > it, but there is another problem. It seems that all these three modules
> > > are dependent on GCCcore/8.2.0:
> > > nlpl-python-candy/201912/3.7 nlpl-scipy/201910/3.7
> > > nlpl-tensorflow/1.15.0/3.7
> > >
> > > As far as I can tell, this makes them incompatible with the new CUDA
> > > modules, which use GCCcore/8.3.0.
> > >
> > > Is it possible to re-compile the NLPL modules with GCC 8.3?
> > >
> > > 23.01.2020 18:10, Andrei Kutuzov wrote:
> > >> Hi Stephan,
> > >>
> > >> In the end, it was too early to celebrate. Indeed, I can run python3
> > >> from the nlpl-tensorflow/1.15.0/3.7 module now.
> > >>
> > >> But it seems that TF in this module was compiled a different CuDNN
> > >> version than the one used currently on the Saga GPU nodes.
> > >>
> > >> The result is that it is impossible to run GPU jobs with this module. TF
> > >> first produces the following warnings:
> > >>
> > >> 2020-01-23 17:55:36.543187: E
> > >> tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN
> > >> library: 7.4.2 but source was compiled with: 7.6.0. CuDNN library major
> > >> and minor version needs to match or have higher minor version in case of
> > >> CuDNN 7.0 or later version. If using a binary install, upgrade your
> > >> CuDNN library. If building from sources, make sure the library loaded
> > >> at runtime is compatible with the version specified during compile
> > >> configuration.
> > >>
> > >> ...and then it fails like this:
> > >>
> > >> tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
> > >> (0) Unknown: Failed to get convolution algorithm. This is probably
> > >> because cuDNN failed to initialize, so try looking to see if a warning
> > >> log message was printed above.
> > >>
> > >>
> > >> I guess, TF should be compiled with CuDNN 7.4.2 in order to work
> > >> properly on Saga.
> > >>
> > >>
> > >> On 1/13/20 10:10 AM, Sara Stymne wrote:
> > >>> Hi Stephan,
> > >>>
> > >>> Yes, it seems to work fine for me as well. I can load and run nlpl-uuparser/2.3.1, which also loads many of the other modules, and python seems to work fine.
> > >>>
> > >>> Thanks for resolving this so quickly!
> > >>>
> > >>> Best,
> > >>> Sara
> > >>>
> > >>>
> > >>> 12 jan 2020 kl. 20:34 skrev Andrei Kutuzov <andreku at ifi.uio.no>
> > >>> :
> > >>>
> > >>>> Hi Stephan,
> > >>>>
> > >>>> Yes, I can confirm that at least for me this works. I can now run
> > >>>> python3 from the nlpl-tensorflow/1.15.0/3.7 module.
> > >>>>
> > >>>> Thanks for resolving this!
> > >>>>
> > >>>> 12.01.2020 4:26, Stephan Oepen wrote:
> > >>>>> hi again, sara, andrey, all,
> > >>>>>
> > >>>>> i believe i managed to track down this problem and was relieved to see
> > >>>>> it is a recently introduced issue: the NLPL binaries for these Python
> > >>>>> add-on modules had inadvertently had their set-group-id bit ('g+s')
> > >>>>> set, which i am pretty sure was the result of a major recursive
> > >>>>> adjustment of file permissions right before the holidays. this bit
> > >>>>> (probably) should be set on directories, where it will cause the group
> > >>>>> owner to be inherited onto new sub-directories or files; but on
> > >>>>> executable files (run by anyone but me or root), it actually caused a
> > >>>>> loss of privileges that prevented the search for the base shared
> > >>>>> libraries. note to self: this was tedious to debug, because the
> > >>>>> problem goes away when running in the scope of strace(1); it turns
> > >>>>> out, strace(1) prevents setuid(2) and setgid(2) execution ...
> > >>>>>
> > >>>>> sara and andrey, please try again. i hope the NLPL add-on modules are
> > >>>>> back to normal now?
> > >>>>>
> > >>>>> all best, oe
> > >>>>>
> > >>>>> On Fri, Jan 10, 2020 at 11:52 AM Sara Stymne <sara.stymne at lingfil.uu.se> wrote:
> > >>>>>>
> > >>>>>> No, neither of us had tried it before. I think it might have worked for Ali, but I'm not sure.
> > >>>>>>
> > >>>>>>
> > >>>>>> Best,
> > >>>>>>
> > >>>>>> Sara
> > >>>>>>
> > >>>>>>
> > >>>>>> ________________________________
> > >>>>>> Från: Stephan Oepen <oe at ifi.uio.no>
> > >>>>>> Skickat: den 10 januari 2020 11:50:14
> > >>>>>> Till: Sara Stymne
> > >>>>>> Kopia: Martin Matthiesen; Ali Basirat; infrastructure
> > >>>>>> Ämne: Re: [NLPL Task Force (A)] NLPL modules python issue
> > >>>>>>
> > >>>>>>> We tried it on both my and Artur's accounts here, and had the same issue.
> > >>>>>>
> > >>>>>>>> python: error while loading shared libraries: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory
> > >>>>>>
> > >>>>>> had either of you tried before (in other words, is this a recent
> > >>>>>> problem)? i installed most of these modules last november, but cannot
> > >>>>>> know how many people have tried using them (i believe i know for sure
> > >>>>>> several of them work for vinit and yves) ...
> > >>>>>>
> > >>>>>> oe
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
> > >>>>>>
> > >>>>>> E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Andrei
> > >>>> PhD Candidate at Language Technology Group (LTG)
> > >>>> University of Oslo
> > >>>
> > >>
> > >>
> > >
> > >
> >
> >
> > --
> > Andrei
> > PhD Candidate at Language Technology Group (LTG)
> > University of Oslo
More information about the infrastructure
mailing list