[NLPL Task Force (A)] rolling your own BERT (and maybe ELMo) on Saga

Antti Virtanen sajvir at utu.fi
Mon Jan 27 15:11:01 UTC 2020


?

?Here's a (quick and dirty) repo for the code we used to train FinBERT: https://github.com/haamis/DeepLearningExamples_FinBERT/tree/master/TensorFlow/LanguageModeling/BERT_nonscaling. This one has the sbatch files used: https://github.com/haamis/BERT-pretraining?


-Antti?


________________________________
From: Stephan Oepen <oe at ifi.uio.no>
Sent: Monday, January 27, 2020 4:48 PM
To: Antti Virtanen
Cc: Andrei Kutuzov; Filip Ginter; infrastructure
Subject: Re: rolling your own BERT (and maybe ELMo) on Saga

thanks, antti!  we had previously put NLPL versions of TF and Horovod on Saga, so possibly the trickier parts actually are already in place :-).  once you have something resembling a sample invocation (preferably on a smallish test case, say targetting 6 gpus on two nodes), i will be eager to test for you!  we have Puhti access too, so if needbe i can try there or look up specific version numbers ...

cheers, oe


On Mon, 27 Jan 2020 at 15:39 Antti Virtanen <sajvir at utu.fi<mailto:sajvir at utu.fi>> wrote:
Hi,

We used the tensorflow/1.13.1-hvd module on Puhti. As you might figure out from the name it includes Tensorflow 1.13 and Horovod 0.16.4 plus any dependencies those have (https://docs.csc.fi/apps/tensorflow/). I can give you a list of packages in that module from Puhti if you wish. Also worthy of note is that we had to create symlinks in the code directory to cuda files `libdevice.10.bc` and `ptxas` to get XLA working correctly, although I believe this is the fault of Puhti's environment being misconfigured.

-Antti
________________________________________
From: Andrei Kutuzov <andreku at ifi.uio.no<mailto:andreku at ifi.uio.no>>
Sent: Monday, January 27, 2020 4:15 PM
To: Stephan Oepen
Cc: Antti Virtanen; Filip Ginter; infrastructure
Subject: Re: rolling your own BERT (and maybe ELMo) on Saga

No, I tried only multiple GPUs (up to 4) within the same node.

27.01.2020 15:14, Stephan Oepen wrote:
> across multiple nodes?  oe
>
>
> On Mon, 27 Jan 2020 at 15:07 Andrei Kutuzov <andreku at ifi.uio.no<mailto:andreku at ifi.uio.no>
> <mailto:andreku at ifi.uio.no<mailto:andreku at ifi.uio.no>>> wrote:
>
>     27.01.2020 14:29, Stephan Oepen wrote:
>     >> Antti can tell about the exact GPU stuff needed. We will run the
>     tutorial on puhti since this is a tried and tested environment for
>     us, and we have little time to prepare, so we play it safe. But
>     Antti can tell what it takes to run the BERT code.
>     > yes, if possible, i could see myself try and replicate your software
>     > environment on Saga ... the multi-gpu part sounds like an interesting
>     > new challenge :-)!
>     Hi all,
>
>     Well, at least TensorFlow has no problems with multi-GPU training on
>     Saga, works more or less out of the box.
>
>
>     --
>     Andrei
>     PhD Candidate at Language Technology Group (LTG)
>     University of Oslo
>


--
Andrei
PhD Candidate at Language Technology Group (LTG)
University of Oslo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20200127/ebc7a73d/attachment.htm>


More information about the infrastructure mailing list