[NLPL Task Force (A)] automated synchronization of NLPL vectors repository
Stephan Oepen
oe at ifi.uio.no
Wed Apr 11 15:03:05 UTC 2018
for human users, there is an interactive catalogue browser at
http://vectors.nlpl.eu/repository/
the identifier on each entry corresponds to the file name on disk. one
could maybe imagine a find(1)-like command-line tool that supports querying
the catalogue?
the filesystem as a catalogue is restricted to a tree, so one needs to
impose a total order on indexing criteria, e.g. something like language /
corpus / framework / tokenizer / lemma type / dimensions/ ...
in this scheme, it would be hard work finding, say, all models over
wikipedia-derived corpora, or all fasttext models using pos-disambiguated
lemmas.
yes, we would like to hear experience reports and suggestions for increased
usability. but i do believe the library analogy is informative, and it is
worth putting effort onto flexibility and scalability into the future.
best, oe
On Wed, 11 Apr 2018 at 16:49 Andrei Kutuzov <andreku at ifi.uio.no> wrote:
> The "/dataset/version/eng.vectors.xz" naming scheme will only work if
> you have precisely one model for each language. This is certainly not
> the case for this repository already, and will become even more
> problematic in the future, when more models will come.
>
> I also like plain and simple file and directory names very much :-) But
> in this case there is a lot of various parameters of the models, and one
> unfortunately can't cover all of them in filenames.
>
> Still, I think it would be great to discuss it further and may be to
> find a solution suitable for all.
>
>
> On 04/11/2018 03:38 PM, Tiedemann, Jörg wrote:
> >
> > Well, the file system itself is a catalogue and I don't see why we
> cannot use it to organize our data. I agree with Filip that parsing a
> potentially large JSON file to find the location of a resource I'm looking
> for is not very convenient. Using standards is fine but can't that be
> achieved by using path and file name?
> >
> > .../dataset/version/eng.vectors.xz
> >
> > No one prevents us from also providing a json file or anything else to
> find that resource as well but having unintuitive file paths and names is
> not very convenient from my point of view.
> >
> > Maybe something to discuss with the whole team at some point.
> >
> > Best,
> > Jörg
> >
> >
> >
> >> On 11 Apr 2018, at 14.21, Andrei Kutuzov <andreku at ifi.uio.no> wrote:
> >>
> >> Hi all,
> >>
> >> By the way, one can use the alias data/vectors/latest.json and not
> >> bother about the version of the JSON catalog file (11 or else). The same
> >> is true for the name of the directory with the actual models.
> >>
> >>> On 04/11/2018 01:02 PM, Stephan Oepen wrote:
> >>> hi filip,
> >>>
> >>> thanks for your feedback. i am happy for the opportunity to discuss
> >>> usability aspects! i am copying andrey, the maintainer of the vectors
> >>> repository.
> >>>
> >>> casting the catalogue in JSON (with standard language codes) actually
> >>> was intended to make script-based access easier while improving
> >>> scalability. consider the analogy of maintaining a small number of
> >>> books on a shelf in your office vs. curating a much larger collection
> >>> in a library. ultimately, your code needs to be informed about the
> >>> path to the model. before, you happened to know and remember
> >>> ‘.../conll17/English/en.vectors.xz’, whereas now the corresponding
> >>> string is something like ‘.../11/40.zip’, which you first had to look
> >>> up in the catalogue and may find harder to remember. but either way,
> >>> each model corresponds to one location on disk, so users and scripts
> >>> need to be able to look these up. before, one had to either know or
> >>> browse the directory structure and interpret non-standard language
> >>> names; this would have been harder to script than loading the JSON
> >>> file and looking up by ISO language code, i would argue. also, the
> >>> old directory collection only provided limited metadata, there was no
> >>> principled way to add entries (we have another 31 english models in
> >>> the current repository, besides the CoNLL 2017 one), and there was no
> >>> support for versioning (other than duplicating directories). our hope
> >>> is that these benefits and the expected growth over time (while
> >>> maintaining stability and replicability for older entries) will
> >>> outweigh the one-time inconvenience for expert users like you to learn
> >>> the new naming scheme.
> >>>
> >>> finally, we recommend to not unzip the archive (which would lead to
> >>> data duplication) but rather read the vectors from the archive
> >>> in-situ. there is some example code for how to do that in python
> >>> here:
> >>>
> >>> http://wiki.nlpl.eu/index.php/Vectors/home
> >>>
> >>> —seriously, are you not at all swayed by at least some of these
> >>> arguments? i think one could discuss two fundamental questions, at
> >>> least: (a) ‘local’ shelf vs. public library, and (b) how best to
> >>> organize the library, in particular its catalogue. i would like to
> >>> think that NLPL is about the library way of thinking, so would be
> >>> happiest to hear specific suggestions for how to better catalogue it.
> >>> what format would you suggest for the catalogue, provided that it
> >>> needs to afford multiple ways of looking for individual models?
> >>>
> >>> best wishes, oe
> >>>
> >>>
> >>>> On Wed, Apr 11, 2018 at 11:48 AM, Filip Ginter <figint at utu.fi> wrote:
> >>>> Hi guys
> >>>>
> >>>> Before, when I needed English vectors from CoNLL17, I read the file
> >>>> vectors/CoNLL17/en.vectors.xz on taito. Stephan was unhappy about
> this and
> >>>> asked me to delete the vectors from /proj/nlpl. Now, to achieve the
> same, I
> >>>> need to parse a json file helpfully named 11.json, which eventually
> tells me
> >>>> the vectors are in the file 11/40.zip, which I then need to unzip and
> then I
> >>>> get my vectors. That is not what I would call an improvement over
> >>>> vectors/CoNLL17/en.vectors.xz . :D ...not to complain or anything, I
> will
> >>>> grab my own copy of these vectors from CoNLL and ignore the /proj/nlpl
> >>>> version, but maybe you want to be aware that this choice of layered,
> >>>> numbered files does not suit a script-driven workflow. :)
> >>>>
> >>>> Cheers
> >>>>
> >>>> F
> >>>>
> >>>>
> >>>>> On Thu, Mar 8, 2018 at 10:06 AM, Filip Ginter <figint at utu.fi> wrote:
> >>>>>
> >>>>> Hi Stephan
> >>>>>
> >>>>> Wiped.
> >>>>>
> >>>>> I have my own copy to run my scripts, which pays off now. :D
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> F
> >>>>>
> >>>>>
> >>>>>> On Tue, Mar 6, 2018 at 4:07 PM, Stephan Oepen <oe at ifi.uio.no>
> wrote:
> >>>>>>
> >>>>>> hi filip,
> >>>>>>
> >>>>>> andrey and i have a new release candidate of the NLPL vectors
> >>>>>> repository ready for public announcement; we presented the emerging
> >>>>>> structure at the NLPL walk-through during the winter school. for
> some
> >>>>>> general information, please see:
> >>>>>>
> >>>>>> http://wiki.nlpl.eu/index.php/Vectors/home
> >>>>>>
> >>>>>> to actually look at the current set of files, you will have to login
> >>>>>> into Abel. but i would like to change that and make
> >>>>>> ‘/proj/nlpl/data/vectors/’ a replica of the corresponding Abel
> >>>>>> directory, using rsync(1).
> >>>>>>
> >>>>>> to accomplish that, could i ask you to remove the contents of the
> >>>>>> current ‘/proj/nlpl/data/vectors/’ on Taito please? we have
> included
> >>>>>> the CoNLL 2017 embeddings in version 1.1 of the NLPL repository, so
> >>>>>> putting a copy of the NLPL vectors repository on Taito will still
> make
> >>>>>> those models available (albeit using a different naming scheme, of
> >>>>>> course).
> >>>>>>
> >>>>>> with thanks in advance, oe
> >>>>>
> >>>>>
> >>>>
> >>
> >>
> >> --
> >> Andrei
> >> PhD Candidate at Language Technology Group (LTG)
> >> University of Oslo
>
>
> --
> Andrei
> PhD Candidate at Language Technology Group (LTG)
> University of Oslo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20180411/60228698/attachment.htm>
More information about the infrastructure
mailing list