<div><div dir="auto">for human users, there is an interactive catalogue browser at</div><div dir="auto"><br></div><div dir="auto"><div><a href="http://vectors.nlpl.eu/repository/">http://vectors.nlpl.eu/repository/</a></div><br></div><div dir="auto">the identifier on each entry corresponds to the file name on disk. one could maybe imagine a find(1)-like command-line tool that supports querying the catalogue?</div><div dir="auto"><br></div><div dir="auto">the filesystem as a catalogue is restricted to a tree, so one needs to impose a total order on indexing criteria, e.g. something like language / corpus / framework / tokenizer / lemma type / dimensions/ ...</div><div dir="auto"><br></div><div dir="auto">in this scheme, it would be hard work finding, say, all models over wikipedia-derived corpora, or all fasttext models using pos-disambiguated lemmas.</div><div dir="auto"><br></div><div dir="auto">yes, we would like to hear experience reports and suggestions for increased usability. but i do believe the library analogy is informative, and it is worth putting effort onto flexibility and scalability into the future.</div><div dir="auto"><br></div><div dir="auto">best, oe</div><div dir="auto"><br></div><br><div class="gmail_quote"><div>On Wed, 11 Apr 2018 at 16:49 Andrei Kutuzov <<a href="mailto:andreku@ifi.uio.no">andreku@ifi.uio.no</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The "/dataset/version/eng.vectors.xz" naming scheme will only work if<br> you have precisely one model for each language. This is certainly not<br> the case for this repository already, and will become even more<br> problematic in the future, when more models will come.<br> <br> I also like plain and simple file and directory names very much :-) But<br> in this case there is a lot of various parameters of the models, and one<br> unfortunately can't cover all of them in filenames.<br> <br> Still, I think it would be great to discuss it further and may be to<br> find a solution suitable for all.<br> <br> <br> On 04/11/2018 03:38 PM, Tiedemann, Jörg wrote:<br> > <br> > Well, the file system itself is a catalogue and I don't see why we cannot use it to organize our data. I agree with Filip that parsing a potentially large JSON file to find the location of a resource I'm looking for is not very convenient. Using standards is fine but can't that be achieved by using path and file name?<br> > <br> > .../dataset/version/eng.vectors.xz<br> > <br> > No one prevents us from also providing a json file or anything else to find that resource as well but having unintuitive file paths and names is not very convenient from my point of view.<br> > <br> > Maybe something to discuss with the whole team at some point.<br> > <br> > Best,<br> > Jörg <br> > <br> > <br> > <br> >> On 11 Apr 2018, at 14.21, Andrei Kutuzov <<a href="mailto:andreku@ifi.uio.no" target="_blank">andreku@ifi.uio.no</a>> wrote:<br> >><br> >> Hi all,<br> >><br> >> By the way, one can use the alias data/vectors/latest.json and not<br> >> bother about the version of the JSON catalog file (11 or else). The same<br> >> is true for the name of the directory with the actual models.<br> >><br> >>> On 04/11/2018 01:02 PM, Stephan Oepen wrote:<br> >>> hi filip,<br> >>><br> >>> thanks for your feedback. i am happy for the opportunity to discuss<br> >>> usability aspects! i am copying andrey, the maintainer of the vectors<br> >>> repository.<br> >>><br> >>> casting the catalogue in JSON (with standard language codes) actually<br> >>> was intended to make script-based access easier while improving<br> >>> scalability. consider the analogy of maintaining a small number of<br> >>> books on a shelf in your office vs. curating a much larger collection<br> >>> in a library. ultimately, your code needs to be informed about the<br> >>> path to the model. before, you happened to know and remember<br> >>> ‘.../conll17/English/en.vectors.xz’, whereas now the corresponding<br> >>> string is something like ‘.../11/40.zip’, which you first had to look<br> >>> up in the catalogue and may find harder to remember. but either way,<br> >>> each model corresponds to one location on disk, so users and scripts<br> >>> need to be able to look these up. before, one had to either know or<br> >>> browse the directory structure and interpret non-standard language<br> >>> names; this would have been harder to script than loading the JSON<br> >>> file and looking up by ISO language code, i would argue. also, the<br> >>> old directory collection only provided limited metadata, there was no<br> >>> principled way to add entries (we have another 31 english models in<br> >>> the current repository, besides the CoNLL 2017 one), and there was no<br> >>> support for versioning (other than duplicating directories). our hope<br> >>> is that these benefits and the expected growth over time (while<br> >>> maintaining stability and replicability for older entries) will<br> >>> outweigh the one-time inconvenience for expert users like you to learn<br> >>> the new naming scheme.<br> >>><br> >>> finally, we recommend to not unzip the archive (which would lead to<br> >>> data duplication) but rather read the vectors from the archive<br> >>> in-situ. there is some example code for how to do that in python<br> >>> here:<br> >>><br> >>> <a href="http://wiki.nlpl.eu/index.php/Vectors/home" rel="noreferrer" target="_blank">http://wiki.nlpl.eu/index.php/Vectors/home</a><br> >>><br> >>> —seriously, are you not at all swayed by at least some of these<br> >>> arguments? i think one could discuss two fundamental questions, at<br> >>> least: (a) ‘local’ shelf vs. public library, and (b) how best to<br> >>> organize the library, in particular its catalogue. i would like to<br> >>> think that NLPL is about the library way of thinking, so would be<br> >>> happiest to hear specific suggestions for how to better catalogue it.<br> >>> what format would you suggest for the catalogue, provided that it<br> >>> needs to afford multiple ways of looking for individual models?<br> >>><br> >>> best wishes, oe<br> >>><br> >>><br> >>>> On Wed, Apr 11, 2018 at 11:48 AM, Filip Ginter <<a href="mailto:figint@utu.fi" target="_blank">figint@utu.fi</a>> wrote:<br> >>>> Hi guys<br> >>>><br> >>>> Before, when I needed English vectors from CoNLL17, I read the file<br> >>>> vectors/CoNLL17/en.vectors.xz on taito. Stephan was unhappy about this and<br> >>>> asked me to delete the vectors from /proj/nlpl. Now, to achieve the same, I<br> >>>> need to parse a json file helpfully named 11.json, which eventually tells me<br> >>>> the vectors are in the file 11/40.zip, which I then need to unzip and then I<br> >>>> get my vectors. That is not what I would call an improvement over<br> >>>> vectors/CoNLL17/en.vectors.xz . :D ...not to complain or anything, I will<br> >>>> grab my own copy of these vectors from CoNLL and ignore the /proj/nlpl<br> >>>> version, but maybe you want to be aware that this choice of layered,<br> >>>> numbered files does not suit a script-driven workflow. :)<br> >>>><br> >>>> Cheers<br> >>>><br> >>>> F<br> >>>><br> >>>><br> >>>>> On Thu, Mar 8, 2018 at 10:06 AM, Filip Ginter <<a href="mailto:figint@utu.fi" target="_blank">figint@utu.fi</a>> wrote:<br> >>>>><br> >>>>> Hi Stephan<br> >>>>><br> >>>>> Wiped.<br> >>>>><br> >>>>> I have my own copy to run my scripts, which pays off now. :D<br> >>>>><br> >>>>> Cheers<br> >>>>><br> >>>>> F<br> >>>>><br> >>>>><br> >>>>>> On Tue, Mar 6, 2018 at 4:07 PM, Stephan Oepen <<a href="mailto:oe@ifi.uio.no" target="_blank">oe@ifi.uio.no</a>> wrote:<br> >>>>>><br> >>>>>> hi filip,<br> >>>>>><br> >>>>>> andrey and i have a new release candidate of the NLPL vectors<br> >>>>>> repository ready for public announcement; we presented the emerging<br> >>>>>> structure at the NLPL walk-through during the winter school. for some<br> >>>>>> general information, please see:<br> >>>>>><br> >>>>>> <a href="http://wiki.nlpl.eu/index.php/Vectors/home" rel="noreferrer" target="_blank">http://wiki.nlpl.eu/index.php/Vectors/home</a><br> >>>>>><br> >>>>>> to actually look at the current set of files, you will have to login<br> >>>>>> into Abel. but i would like to change that and make<br> >>>>>> ‘/proj/nlpl/data/vectors/’ a replica of the corresponding Abel<br> >>>>>> directory, using rsync(1).<br> >>>>>><br> >>>>>> to accomplish that, could i ask you to remove the contents of the<br> >>>>>> current ‘/proj/nlpl/data/vectors/’ on Taito please? we have included<br> >>>>>> the CoNLL 2017 embeddings in version 1.1 of the NLPL repository, so<br> >>>>>> putting a copy of the NLPL vectors repository on Taito will still make<br> >>>>>> those models available (albeit using a different naming scheme, of<br> >>>>>> course).<br> >>>>>><br> >>>>>> with thanks in advance, oe<br> >>>>><br> >>>>><br> >>>><br> >><br> >><br> >> -- <br> >> Andrei<br> >> PhD Candidate at Language Technology Group (LTG)<br> >> University of Oslo<br> <br> <br> -- <br> Andrei<br> PhD Candidate at Language Technology Group (LTG)<br> University of Oslo<br> </blockquote></div></div>