[NLPL Users] [NLPL Team] updates from the NLPL vectors repository

Andrei Kutuzov andreku at ifi.uio.no
Sun Dec 29 20:08:34 CET 2019


Dear colleagues,

At the end of 2019, we announce the release of the NLPL vectors
repository version 2.0.

The NLPL vectors repository is a large collection of distributional
embeddings for multiple languages, stored in the NLPL project directory
below '/cluster/shared/nlpl/data/vectors/' (on Saga) and
'/proj/nlpl/data/vectors/' (on Taito). There is also an online service
for model exploring and downloading, available at
http://vectors.nlpl.eu/repository/.

See more info about the repository at
http://wiki.nlpl.eu/index.php/Vectors/home

Main changes in the 2.0 release:

- The files moved from Abel to Saga.
- It includes Finnish BERT models and more ELMo models for different
languages; the new ELMo models are trained by us on much larger datasets
that those used in the ELMoForManyLangs models which were present in the
repository before.
- More 'static' word embeddings (word2vec, fastText, Glove, etc) added;
we now feature more than 200 models overall.
- All ZIP archives with static word embeddings  now contain not only
plain text model, but also the same model in the binary format, for
faster loading.
- Metadata made more consistent and is now documented at
http://wiki.nlpl.eu/index.php/Vectors/metadata


16.01.2019 16:52, Stephan Oepen wrote:
> dear colleagues,
> 
> at the end of last year, we had several project-internal deliverables,
> including:
> 
>   E.2 Updated Embeddings, including additional languages
> 
> i am happy to report that there have been continuous updates to our
> embeddings repository throughout the year.  since the end of 2017
> (repository version 1.1), we have added some 100 models for about 45
> languages.  we have also added a new type of models (ELMo for 44
> languages, contributed by the HIT team in china, trained on samples
> from the NLPL Common Crawl corpora that filip and colleagues compiled
> in 2017); and we agreed to collaborate with the
>http://rusvectores.org’ project and act as their distribution channel
> for a variety of russian models.
> 
> some general documentation on the repository and an interactive search
> tool among available models are available on-line:
> 
>   http://wiki.nlpl.eu/index.php/Vectors/home
>   http://vectors.nlpl.eu/repository/
> 
> we now expect to freeze the current repository version (1.2) and start
> work towards a successor release.  the reason for minting a new
> version number in 2019 is not only that it nicely reflects the NLPL
> schedule of deliverables, but more contentfully that we have
> identified a number of non–backwards-compatible revisions that we want
> to apply to the metadata and internal structure of each repository
> entry (i.e. the ‘.zip’ archives containing the actual models).  we
> plan on presenting the current state of affairs during the upcoming
> winter school and ask for input on our plans for redesign.
> 
> andrey kutuzov (at UiO) continues to be the main developer of the
> repository; please do not hesitate to share your feedback or
> suggestions for improvement with andrey and me!
> 
> best wishes, oe
> 


-- 
Andrei
PhD Candidate at Language Technology Group (LTG)
University of Oslo


More information about the users mailing list