[NLPL Task Force (A)] automated synchronization of NLPL vectors repository

Wed Apr 11 19:19:55 UTC 2018

> Just imagine I would organise all files in OPUS with sequential numbers and provide a huge file that explains what they contain.

i believe an important difference is that OPUS inherently has a
natural tree-shaped hierarchical structure.  for the internal
organization of the vectors repository, on the other hand, there are
several mostly independent dimensions than combine freely and, thus,
can give rise to different hierarchical relations.

originally, we had considered a directory structure like the following

  language / corpus / framework / ... / identifier

but what about the language family ‘nor’ relative to its two sub-types
‘nob’ and ‘nno’ (bokmål and nynorsk), and where would a combined model
for, say, danish, both written variants of norwegian, and swedish go?
for english, some of the relevant corpora are tipster, gigaword,
wikipedia, encow, and engc3.  it is customary to train models on
various combinations of these; in which directory should these reside?
 wikipedia-derived corpora (and some others) naturally evolve over
time, so they should be versioned; the same applies to different
frameworks, reflecting updates and new releases of the software.

once we have committed to one ‘language’, ‘corpus’, and ‘framwork’
(e.g. glove), we need to pre-process the text: for english, we
currently sentence-split, tokenize, and PoS tag using CoreNLP.  but
one could imagine other tools, with different lemmatization
conventions (and, of course, even different tokenization schemes, in
principle), so a user needs to be able to select on this basis.
furthermore, each framework has a custom set of hyper-parameters, e.g.
window size and dimensions, and for different tasks you might prefer
vectors of dimension 50 over ones of dimension 300.

before we decided to encode the meta information in JSON, we were
trying to package the above into the file names, e.g. the ‘identifier’
could be something like ‘corenlp350_lemma+pos_10_100_16’.  at this
point, we were looking at fairly long and unpredictable paths, which i
believe would not really afford the kind of straightforward scripting
you have in mind.  equally importantly, we wanted to have a unique URI
scheme for each model, somewhat like a persistent handle.  in the
current universe, the identifier for the english CoNLL 2017 model, for
example, is ‘http://vectors.nlpl.eu/repository/11/40.zip’.  this fits
nicely into a table or footnote, much more so than the much longer
path names sketched above (trying to encode all meta information in a
human-readable string).

—i am grateful for the feedback and discussion, but i suspect that
what you have in mind may be similar to where we started?  so the
above is an attempt to spell out more of the thought process that led
us towards the current scheme.  at least some of this, we should
probably add to the wiki page for the vectors repository.  and i would
absolutely not want to claim that the current design cannot be
improved on further.  but the above considerations, in my view, should
provide the starting point for that search for catalogue improvements.

best wishes, oe