[NLPL Infrastructure] Fwd: [NLPL Board] collaboration on monolingual BERT training
Stephan Oepen
oe at ifi.uio.no
Wed Jan 13 09:44:50 UTC 2021
colleagues:
how about we hold a virtual Nordic BERT summit sometime in early
february, as part of our NLPL use case in EOSC-Nordic. for example,
12:00-15:00 CET on a friday, february 5 or 12?
we could invite participants to review their experience so far in
building BERT-like models, exchange best practices regarding data
preparation and training, offer to help provide a uniform software
environment for these efforts on Puhti and Saga, and gather input on
what more would be useful to coordinate (and maybe also which
allocations our danish and swedish partners would want, on which
systems)?
oe
---------- Forwarded message ---------
From: Daniel Hershcovich <dh at di.ku.dk>
Date: Wed, Jan 13, 2021 at 10:20 AM
Subject: Re: [NLPL Board] collaboration on monolingual BERT training
To: Barbara Plank <bapl at itu.dk>, Tiedemann, Jörg
<jorg.tiedemann at helsinki.fi>, Filip Ginter <figint at utu.fi>
Cc: erikve at ifi.uio.no <erikve at ifi.uio.no>, Andrei Kutuzov
<andreku at ifi.uio.no>, Lilja Øvrelid <liljao at ifi.uio.no>, board
<board at nlpl.eu>, Jeremy Claude Barnes <jeremycb at ifi.uio.no>, Sampo
Pyysalo <sampo.pyysalo at gmail.com>
We haven't looked into training much yet, but are interested in
principle, especially since we've observed mixed results for Danish
BERT (and Nordic BERT) over mBERT (worse for summarisation and
coreference, but slightly better for NER).
Best,
Daniel
____________________
Daniel Hershcovich, PhD
Tenure-Track Assistant Professor
Department of Computer Science
University of Copenhagen
Universitetsparken 1, 2100 Copenhagen, Denmark
https://danielhers.github.io/
________________________________
Fra: board <board-bounces at nlpl.eu> på vegne af Barbara Plank <bapl at itu.dk>
Sendt: 13. januar 2021 09:25:33
Til: Tiedemann, Jörg; Filip Ginter
Cc: erikve at ifi.uio.no; Andrei Kutuzov; Lilja Øvrelid; board; Jeremy
Claude Barnes; Sampo Pyysalo
Emne: Re: [NLPL Board] collaboration on monolingual BERT training
For Danish, there are also a few existing ones around, created by a
SME (Danish Bert, which was originally available only in an uncased
version). An electra model appeared recently, too (student from
Aarhus).
I am currently looking into Danish BERT varieties trained on more
varied data, and the infrastructure would indeed be of great help.
Best,
Barbara
________________________________
From: board <board-bounces at nlpl.eu> on behalf of Tiedemann, Jörg
<jorg.tiedemann at helsinki.fi>
Sent: Wednesday, January 13, 2021 9:11 AM
To: Filip Ginter <figint at utu.fi>
Cc: erikve at ifi.uio.no <erikve at ifi.uio.no>; Andrei Kutuzov
<andreku at ifi.uio.no>; Lilja Øvrelid <liljao at ifi.uio.no>; board
<board at nlpl.eu>; Jeremy Claude Barnes <jeremycb at ifi.uio.no>; Sampo
Pyysalo <sampo.pyysalo at gmail.com>
Subject: Re: [NLPL Board] collaboration on monolingual BERT training
I also tested with a combined Finnish / Estonian electra model and
would definitely also be interested in that (maybe including some
smaller Uralic languages as well?). I already had names for all kinds
of models, e.g. festBERT for the Finnish/Estonian one and LARS for the
Scandinavian one (don’t exactly remember the reasoning behind it -
maybe it was Languages AcRoss Scandinavia?) - and then the obvious
ones like IceBERT, DanBERT, SweBERT, NorBERT ...
(FestBERT is my favourite …)
All the best,
Jörg
*****************************************************************
Jörg Tiedemann
Language Technology
https://blogs.helsinki.fi/language-technology/
University of Helsinki
On 13. Jan 2021, at 9.53, Filip Ginter <figint at utu.fi> wrote:
Hi
I cc: this to Sampo. He has by now trained - was it 40? - monolingual
berts and seems to always want to make some more of those. :)
...Sampo? Scandinavian bert sounds fun. I guess Finnish is not welcome
there. :D (which is fine! couldn't help the attempt at a joke)
- Filip
On Wed, Jan 13, 2021 at 9:48 AM Tiedemann, Jörg
<jorg.tiedemann at helsinki.fi> wrote:
I am definitely interested especially in the Scandinavian model.
I did some preliminary experiments with multilingual ELECTRA models
but didn’t finish anything in the end. Joining forces for getting
things done would be great!
All the best,
Jörg
*****************************************************************
Jörg Tiedemann
Language Technology
https://blogs.helsinki.fi/language-technology/
University of Helsinki
On 13. Jan 2021, at 0.38, Stephan Oepen <oe at ifi.uio.no> wrote:
colleagues:
as part of the NLPL use case in EOSC-Nordic (and in joint work with
the norwegian SANT project), our group has been working on an
infrastructure for training BERT-like models (following in the
footsteps of the FinBERT pioneers at turku :-), please see:
http://wiki.nlpl.eu/Vectors/norlm
in talking earlier today, the idea came up of training similar models
for danish and swedish and possibly also look into the utility of a
joint scandinavian model. could any of you be interested in
collaborating on this?
please pardon my ignorance: what is the (monolingual) BERT-like
situation for the other scandinavian languages?
in terms of technical infrastructure, we believe the software is ready
to run on Saga. we have put significant effort into automating the
installation (using EasyBuild) and will want to replicate the training
environment on Puhti (and possibly other scandinavian systems) soon.
doing this jointly with NLPL partners would be very much in the spirit
of our EOSC-Nordic use case ...
best wishes, oe
More information about the infrastructure
mailing list