[NLPL Task Force (A)] automated synchronization of NLPL vectors repository
Stephan Oepen
oe at ifi.uio.no
Wed Apr 11 11:02:44 UTC 2018
hi filip,
thanks for your feedback. i am happy for the opportunity to discuss
usability aspects! i am copying andrey, the maintainer of the vectors
repository.
casting the catalogue in JSON (with standard language codes) actually
was intended to make script-based access easier while improving
scalability. consider the analogy of maintaining a small number of
books on a shelf in your office vs. curating a much larger collection
in a library. ultimately, your code needs to be informed about the
path to the model. before, you happened to know and remember
‘.../conll17/English/en.vectors.xz’, whereas now the corresponding
string is something like ‘.../11/40.zip’, which you first had to look
up in the catalogue and may find harder to remember. but either way,
each model corresponds to one location on disk, so users and scripts
need to be able to look these up. before, one had to either know or
browse the directory structure and interpret non-standard language
names; this would have been harder to script than loading the JSON
file and looking up by ISO language code, i would argue. also, the
old directory collection only provided limited metadata, there was no
principled way to add entries (we have another 31 english models in
the current repository, besides the CoNLL 2017 one), and there was no
support for versioning (other than duplicating directories). our hope
is that these benefits and the expected growth over time (while
maintaining stability and replicability for older entries) will
outweigh the one-time inconvenience for expert users like you to learn
the new naming scheme.
finally, we recommend to not unzip the archive (which would lead to
data duplication) but rather read the vectors from the archive
in-situ. there is some example code for how to do that in python
here:
http://wiki.nlpl.eu/index.php/Vectors/home
—seriously, are you not at all swayed by at least some of these
arguments? i think one could discuss two fundamental questions, at
least: (a) ‘local’ shelf vs. public library, and (b) how best to
organize the library, in particular its catalogue. i would like to
think that NLPL is about the library way of thinking, so would be
happiest to hear specific suggestions for how to better catalogue it.
what format would you suggest for the catalogue, provided that it
needs to afford multiple ways of looking for individual models?
best wishes, oe
On Wed, Apr 11, 2018 at 11:48 AM, Filip Ginter <figint at utu.fi> wrote:
> Hi guys
>
> Before, when I needed English vectors from CoNLL17, I read the file
> vectors/CoNLL17/en.vectors.xz on taito. Stephan was unhappy about this and
> asked me to delete the vectors from /proj/nlpl. Now, to achieve the same, I
> need to parse a json file helpfully named 11.json, which eventually tells me
> the vectors are in the file 11/40.zip, which I then need to unzip and then I
> get my vectors. That is not what I would call an improvement over
> vectors/CoNLL17/en.vectors.xz . :D ...not to complain or anything, I will
> grab my own copy of these vectors from CoNLL and ignore the /proj/nlpl
> version, but maybe you want to be aware that this choice of layered,
> numbered files does not suit a script-driven workflow. :)
>
> Cheers
>
> F
>
>
> On Thu, Mar 8, 2018 at 10:06 AM, Filip Ginter <figint at utu.fi> wrote:
>>
>> Hi Stephan
>>
>> Wiped.
>>
>> I have my own copy to run my scripts, which pays off now. :D
>>
>> Cheers
>>
>> F
>>
>>
>> On Tue, Mar 6, 2018 at 4:07 PM, Stephan Oepen <oe at ifi.uio.no> wrote:
>>>
>>> hi filip,
>>>
>>> andrey and i have a new release candidate of the NLPL vectors
>>> repository ready for public announcement; we presented the emerging
>>> structure at the NLPL walk-through during the winter school. for some
>>> general information, please see:
>>>
>>> http://wiki.nlpl.eu/index.php/Vectors/home
>>>
>>> to actually look at the current set of files, you will have to login
>>> into Abel. but i would like to change that and make
>>> ‘/proj/nlpl/data/vectors/’ a replica of the corresponding Abel
>>> directory, using rsync(1).
>>>
>>> to accomplish that, could i ask you to remove the contents of the
>>> current ‘/proj/nlpl/data/vectors/’ on Taito please? we have included
>>> the CoNLL 2017 embeddings in version 1.1 of the NLPL repository, so
>>> putting a copy of the NLPL vectors repository on Taito will still make
>>> those models available (albeit using a different naming scheme, of
>>> course).
>>>
>>> with thanks in advance, oe
>>
>>
>
More information about the infrastructure
mailing list