[NLPL Task Force (A)] automated synchronization of NLPL vectors repository

Andrei Kutuzov andreku at ifi.uio.no
Wed Apr 11 14:48:42 UTC 2018


The "/dataset/version/eng.vectors.xz" naming scheme will only work if
you have precisely one model for each language. This is certainly not
the case for this repository already, and will become even more
problematic in the future, when more models will come.

I also like plain and simple file and directory names very much :-) But
in this case there is a lot of various parameters of the models, and one
unfortunately can't cover all of them in filenames.

Still, I think it would be great to discuss it further and may be to
find a solution suitable for all.


On 04/11/2018 03:38 PM, Tiedemann, Jörg wrote:
> 
> Well, the file system itself is a catalogue and I don't see why we cannot use it to organize our data. I agree with Filip that parsing a potentially large JSON file to find the location of a resource I'm looking for is not very convenient. Using standards is fine but can't that be achieved by using path and file name?
> 
> .../dataset/version/eng.vectors.xz
> 
> No one prevents us from also providing a json file or anything else to find that resource as well but having unintuitive file paths and names is not very convenient from my point of view.
> 
> Maybe something to discuss with the whole team at some point.
> 
> Best,
> Jörg 
> 
> 
> 
>> On 11 Apr 2018, at 14.21, Andrei Kutuzov <andreku at ifi.uio.no> wrote:
>>
>> Hi all,
>>
>> By the way, one can use the alias data/vectors/latest.json and not
>> bother about the version of the JSON catalog file (11 or else). The same
>> is true for the name of the directory with the actual models.
>>
>>> On 04/11/2018 01:02 PM, Stephan Oepen wrote:
>>> hi filip,
>>>
>>> thanks for your feedback.  i am happy for the opportunity to discuss
>>> usability aspects!  i am copying andrey, the maintainer of the vectors
>>> repository.
>>>
>>> casting the catalogue in JSON (with standard language codes) actually
>>> was intended to make script-based access easier while improving
>>> scalability.  consider the analogy of maintaining a small number of
>>> books on a shelf in your office vs. curating a much larger collection
>>> in a library.  ultimately, your code needs to be informed about the
>>> path to the model.  before, you happened to know and remember
>>> ‘.../conll17/English/en.vectors.xz’, whereas now the corresponding
>>> string is something like ‘.../11/40.zip’, which you first had to look
>>> up in the catalogue and may find harder to remember.  but either way,
>>> each model corresponds to one location on disk, so users and scripts
>>> need to be able to look these up.  before, one had to either know or
>>> browse the directory structure and interpret non-standard language
>>> names; this would have been harder to script than loading the JSON
>>> file and looking up by ISO language code, i would argue.  also, the
>>> old directory collection only provided limited metadata, there was no
>>> principled way to add entries (we have another 31 english models in
>>> the current repository, besides the CoNLL 2017 one), and there was no
>>> support for versioning (other than duplicating directories).  our hope
>>> is that these benefits and the expected growth over time (while
>>> maintaining stability and replicability for older entries) will
>>> outweigh the one-time inconvenience for expert users like you to learn
>>> the new naming scheme.
>>>
>>> finally, we recommend to not unzip the archive (which would lead to
>>> data duplication) but rather read the vectors from the archive
>>> in-situ.  there is some example code for how to do that in python
>>> here:
>>>
>>> http://wiki.nlpl.eu/index.php/Vectors/home
>>>
>>> —seriously, are you not at all swayed by at least some of these
>>> arguments?  i think one could discuss two fundamental questions, at
>>> least: (a) ‘local’ shelf vs. public library, and (b) how best to
>>> organize the library, in particular its catalogue.  i would like to
>>> think that NLPL is about the library way of thinking, so would be
>>> happiest to hear specific suggestions for how to better catalogue it.
>>> what format would you suggest for the catalogue, provided that it
>>> needs to afford multiple ways of looking for individual models?
>>>
>>> best wishes, oe
>>>
>>>
>>>> On Wed, Apr 11, 2018 at 11:48 AM, Filip Ginter <figint at utu.fi> wrote:
>>>> Hi guys
>>>>
>>>> Before, when I needed English vectors from CoNLL17, I read the file
>>>> vectors/CoNLL17/en.vectors.xz on taito. Stephan was unhappy about this and
>>>> asked me to delete the vectors from /proj/nlpl. Now, to achieve the same, I
>>>> need to parse a json file helpfully named 11.json, which eventually tells me
>>>> the vectors are in the file 11/40.zip, which I then need to unzip and then I
>>>> get my vectors. That is not what I would call an improvement over
>>>> vectors/CoNLL17/en.vectors.xz . :D  ...not to complain or anything, I will
>>>> grab my own copy of these vectors from CoNLL and ignore the /proj/nlpl
>>>> version, but maybe you want to be aware that this choice of layered,
>>>> numbered files does not suit a script-driven workflow. :)
>>>>
>>>> Cheers
>>>>
>>>> F
>>>>
>>>>
>>>>> On Thu, Mar 8, 2018 at 10:06 AM, Filip Ginter <figint at utu.fi> wrote:
>>>>>
>>>>> Hi Stephan
>>>>>
>>>>> Wiped.
>>>>>
>>>>> I have my own copy to run my scripts, which pays off now. :D
>>>>>
>>>>> Cheers
>>>>>
>>>>> F
>>>>>
>>>>>
>>>>>> On Tue, Mar 6, 2018 at 4:07 PM, Stephan Oepen <oe at ifi.uio.no> wrote:
>>>>>>
>>>>>> hi filip,
>>>>>>
>>>>>> andrey and i have a new release candidate of the NLPL vectors
>>>>>> repository ready for public announcement; we presented the emerging
>>>>>> structure at the NLPL walk-through during the winter school.  for some
>>>>>> general information, please see:
>>>>>>
>>>>>>  http://wiki.nlpl.eu/index.php/Vectors/home
>>>>>>
>>>>>> to actually look at the current set of files, you will have to login
>>>>>> into Abel.  but i would like to change that and make
>>>>>> ‘/proj/nlpl/data/vectors/’ a replica of the corresponding Abel
>>>>>> directory, using rsync(1).
>>>>>>
>>>>>> to accomplish that, could i ask you to remove the contents of the
>>>>>> current ‘/proj/nlpl/data/vectors/’ on Taito please?  we have included
>>>>>> the CoNLL 2017 embeddings in version 1.1 of the NLPL repository, so
>>>>>> putting a copy of the NLPL vectors repository on Taito will still make
>>>>>> those models available (albeit using a different naming scheme, of
>>>>>> course).
>>>>>>
>>>>>> with thanks in advance, oe
>>>>>
>>>>>
>>>>
>>
>>
>> -- 
>> Andrei
>> PhD Candidate at Language Technology Group (LTG)
>> University of Oslo


-- 
Andrei
PhD Candidate at Language Technology Group (LTG)
University of Oslo



More information about the infrastructure mailing list