[norlm] NorLM updates

Fri Feb 11 15:24:26 UTC 2022

Dear NorLM subscribers,

The Norwegian Large-scale Language Models (NorLM) project is  pleased to
release a new Norwegian model.

This is _NorBERT 2_: a BERT model trained from scratch on Norwegian data.
Unlike NorBERT 1, it is trained on a much larger corpus consisting of
two parts:

1) Norwegian Colossal Corpus (NCC), non-copyrighted part; 5 billion words;
2) C4 web-crawled corpus, Norwegian part; a random sample of about 9.5
billion words.

We have also enlarged the vocabulary size (50 000 compared to 30 000
subword tokens in NorBERT 1) and applied whole word masking during the
training. More details at the NorBERT web page
(http://wiki.nlpl.eu/Vectors/norlm/norbert).

The resulting model, in particular, outperforms both NorBERT 1 and
NB-BERT-Base by the National Library on the binary sentiment analysis
task. It also fixes a long-standing bug
(https://github.com/ltgoslo/NorBERT/issues/1) which prevented using
NorBERT with the Simpletransformers library.

If you use NorBERT in any way in your work, we highly recommend to
download the new version.

NorBERT 2 is available at the HuggingFace  Model Hub
(https://huggingface.co/ltgoslo/norbert2) or as a direct download from
the NLPL Vector Repository
(http://vectors.nlpl.eu/repository/20/221.zip). Those using Saga cluster
can access the model locally from
`/cluster/shared/nlpl/data/vectors/latest/221`.

As usual, http://norlm.nlpl.eu is the main source of information about
our NorLM models.
Please feel free to ask anything in this mailing list as well.

-- 
Andrey
Language Technology Group (LTG)
University of Oslo