[NLPL Task Force (A)] automated synchronization of NLPL vectors repository

Thu Apr 12 12:38:55 UTC 2018

We probably could indeed introduce subdirectories for languages. But
this in fact will not help much: in the hypothetical 'Norwegian Bokmal'
directory, we would already see 7 different models trained on different
corpora and with different hyperparameters (note that there are several
models trained on one and the same corpus, not one). For English, this
number is 32, and growing.

The problem is indeed that there are several dimensions along which one
can compare embedding models, and language is not always the most
important dimension among those. Some users would like to select all
German models trained on lemmatized corpora; others would like to select
only fastText models trained on word forms with context window size
between 5 and 10. These queries are equally valid.

But as I've said, technically it is of course possible to add language
subdirectories. Considering everything said above, Filip and  Jörg, do
you think it would make it easier for you to deal with the repository?

On 04/11/2018 09:19 PM, Stephan Oepen wrote:
>> Just imagine I would organise all files in OPUS with sequential numbers and provide a huge file that explains what they contain.
> 
> i believe an important difference is that OPUS inherently has a
> natural tree-shaped hierarchical structure.  for the internal
> organization of the vectors repository, on the other hand, there are
> several mostly independent dimensions than combine freely and, thus,
> can give rise to different hierarchical relations.
> 
> originally, we had considered a directory structure like the following
> 
>   language / corpus / framework / ... / identifier
> 
> but what about the language family ‘nor’ relative to its two sub-types
> ‘nob’ and ‘nno’ (bokmål and nynorsk), and where would a combined model
> for, say, danish, both written variants of norwegian, and swedish go?
> for english, some of the relevant corpora are tipster, gigaword,
> wikipedia, encow, and engc3.  it is customary to train models on
> various combinations of these; in which directory should these reside?
>  wikipedia-derived corpora (and some others) naturally evolve over
> time, so they should be versioned; the same applies to different
> frameworks, reflecting updates and new releases of the software.
> 
> once we have committed to one ‘language’, ‘corpus’, and ‘framwork’
> (e.g. glove), we need to pre-process the text: for english, we
> currently sentence-split, tokenize, and PoS tag using CoreNLP.  but
> one could imagine other tools, with different lemmatization
> conventions (and, of course, even different tokenization schemes, in
> principle), so a user needs to be able to select on this basis.
> furthermore, each framework has a custom set of hyper-parameters, e.g.
> window size and dimensions, and for different tasks you might prefer
> vectors of dimension 50 over ones of dimension 300.
> 
> before we decided to encode the meta information in JSON, we were
> trying to package the above into the file names, e.g. the ‘identifier’
> could be something like ‘corenlp350_lemma+pos_10_100_16’.  at this
> point, we were looking at fairly long and unpredictable paths, which i
> believe would not really afford the kind of straightforward scripting
> you have in mind.  equally importantly, we wanted to have a unique URI
> scheme for each model, somewhat like a persistent handle.  in the
> current universe, the identifier for the english CoNLL 2017 model, for
> example, is ‘http://vectors.nlpl.eu/repository/11/40.zip’.  this fits
> nicely into a table or footnote, much more so than the much longer
> path names sketched above (trying to encode all meta information in a
> human-readable string).
> 
> —i am grateful for the feedback and discussion, but i suspect that
> what you have in mind may be similar to where we started?  so the
> above is an attempt to spell out more of the thought process that led
> us towards the current scheme.  at least some of this, we should
> probably add to the wiki page for the vectors repository.  and i would
> absolutely not want to claim that the current design cannot be
> improved on further.  but the above considerations, in my view, should
> provide the starting point for that search for catalogue improvements.
> 
> best wishes, oe
> 

-- 
Andrei
PhD Candidate at Language Technology Group (LTG)
University of Oslo