[NLPL Users] [NLPL Team] updates from the NLPL vectors repository
oe at ifi.uio.no
Sun Dec 29 20:58:05 CET 2019
many thanks for the updates, andrey!
one addition from my side: the vectors repository (as well as the NLPL
corpora collection and parsing and translation data sets) are automatically
replicated among all NLPL systems, i.e. as of today Abel and Saga in norway
and Puhti and Taito in finland. Abel and Taito, however, are about to be
decommissioned. the much larger OPUS parallel corpora used to not be
replicated outside finland, but will also become available on Saga starting
best wishes, oe
On Sun, 29 Dec 2019 at 20:09 Andrei Kutuzov <andreku at ifi.uio.no> wrote:
> Dear colleagues,
> At the end of 2019, we announce the release of the NLPL vectors
> repository version 2.0.
> The NLPL vectors repository is a large collection of distributional
> embeddings for multiple languages, stored in the NLPL project directory
> below '/cluster/shared/nlpl/data/vectors/' (on Saga) and
> '/proj/nlpl/data/vectors/' (on Taito). There is also an online service
> for model exploring and downloading, available at
> See more info about the repository at
> Main changes in the 2.0 release:
> - The files moved from Abel to Saga.
> - It includes Finnish BERT models and more ELMo models for different
> languages; the new ELMo models are trained by us on much larger datasets
> that those used in the ELMoForManyLangs models which were present in the
> repository before.
> - More 'static' word embeddings (word2vec, fastText, Glove, etc) added;
> we now feature more than 200 models overall.
> - All ZIP archives with static word embeddings now contain not only
> plain text model, but also the same model in the binary format, for
> faster loading.
> - Metadata made more consistent and is now documented at
> 16.01.2019 16:52, Stephan Oepen wrote:
> > dear colleagues,
> > at the end of last year, we had several project-internal deliverables,
> > including:
> > E.2 Updated Embeddings, including additional languages
> > i am happy to report that there have been continuous updates to our
> > embeddings repository throughout the year. since the end of 2017
> > (repository version 1.1), we have added some 100 models for about 45
> > languages. we have also added a new type of models (ELMo for 44
> > languages, contributed by the HIT team in china, trained on samples
> > from the NLPL Common Crawl corpora that filip and colleagues compiled
> > in 2017); and we agreed to collaborate with the
> > ‘http://rusvectores.org’ project and act as their distribution channel
> > for a variety of russian models.
> > some general documentation on the repository and an interactive search
> > tool among available models are available on-line:
> > http://wiki.nlpl.eu/index.php/Vectors/home
> > http://vectors.nlpl.eu/repository/
> > we now expect to freeze the current repository version (1.2) and start
> > work towards a successor release. the reason for minting a new
> > version number in 2019 is not only that it nicely reflects the NLPL
> > schedule of deliverables, but more contentfully that we have
> > identified a number of non–backwards-compatible revisions that we want
> > to apply to the metadata and internal structure of each repository
> > entry (i.e. the ‘.zip’ archives containing the actual models). we
> > plan on presenting the current state of affairs during the upcoming
> > winter school and ask for input on our plans for redesign.
> > andrey kutuzov (at UiO) continues to be the main developer of the
> > repository; please do not hesitate to share your feedback or
> > suggestions for improvement with andrey and me!
> > best wishes, oe
> PhD Candidate at Language Technology Group (LTG)
> University of Oslo
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users