From andreku at ifi.uio.no Fri Feb 11 15:24:26 2022 From: andreku at ifi.uio.no (Andrey Kutuzov) Date: Fri, 11 Feb 2022 16:24:26 +0100 Subject: [norlm] NorLM updates Message-ID: <6c62f67c-72ad-e102-642c-1c2912fa657a@ifi.uio.no> Dear NorLM subscribers, The Norwegian Large-scale Language Models (NorLM) project is pleased to release a new Norwegian model. This is _NorBERT 2_: a BERT model trained from scratch on Norwegian data. Unlike NorBERT 1, it is trained on a much larger corpus consisting of two parts: 1) Norwegian Colossal Corpus (NCC), non-copyrighted part; 5 billion words; 2) C4 web-crawled corpus, Norwegian part; a random sample of about 9.5 billion words. We have also enlarged the vocabulary size (50 000 compared to 30 000 subword tokens in NorBERT 1) and applied whole word masking during the training. More details at the NorBERT web page (http://wiki.nlpl.eu/Vectors/norlm/norbert). The resulting model, in particular, outperforms both NorBERT 1 and NB-BERT-Base by the National Library on the binary sentiment analysis task. It also fixes a long-standing bug (https://github.com/ltgoslo/NorBERT/issues/1) which prevented using NorBERT with the Simpletransformers library. If you use NorBERT in any way in your work, we highly recommend to download the new version. NorBERT 2 is available at the HuggingFace Model Hub (https://huggingface.co/ltgoslo/norbert2) or as a direct download from the NLPL Vector Repository (http://vectors.nlpl.eu/repository/20/221.zip). Those using Saga cluster can access the model locally from `/cluster/shared/nlpl/data/vectors/latest/221`. As usual, http://norlm.nlpl.eu is the main source of information about our NorLM models. Please feel free to ask anything in this mailing list as well. -- Andrey Language Technology Group (LTG) University of Oslo