[NLPL Task Force (A)] Versioning PIDs and Metadata

Martin Matthiesen martin.matthiesen at csc.fi
Mon Nov 13 10:08:08 UTC 2017


Hello,

First, it was nice after all meeting you in person in Oslo, even though I dreaded the long day a bit. Joakim, I put you cc, hoping that you might be interested in the issues discussed:

I keep bringing up Versioning, PIDs and metadata in our meetings and I felt I need to be a bit more concrete what I mean by that.

Consider for example this dataset:

The Suomi 24 Sentences Corpus (2016H2)

http://urn.fi/urn:nbn:fi:lb-2017021505

We make use of the Relation field to describe the relation to similar datasets and the Documentation to give attribution information.

We also have a policy on how updates happen:

http://urn.fi/urn:nbn:fi:lb-201710212

Now I know that this sounds all very bureaucratic, but actually not more bureaucratic than, say, citation rules[1]. I can say from experience that we tried to cut some corners with dataset PIDs and versioning and it bite us later. 
 
I am not trying to say that NLPL should copy the model outlined above, but I would like us to make a conscious decision why not to adopt all or parts of it. (I am also not saying that our model is the last word on the matter, comments are welcome.) My thinking with theses processes and formalities is that they should free people to do creative work while still maintaining usability for others.

As to data and software integrity, I propose that we indeed look into ways to have "unit test"-style processes that can ensure uniformity. I mentioned this in Oslo, but I am not sure now who was present then.

Cheers,
Martin


[1] https://writing.wisc.edu/Handbook/PDF/Acknowledging_Sources.pdf


-- 
Martin Matthiesen
CSC - Tieteen tietotekniikan keskus
CSC - IT Center for Science
PL 405, 02101 Espoo, Finland
+358 9 457 2376, martin.matthiesen at csc.fi
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704
Fingerprint: AA25 6F56 5C9A 8B42 009F  BA70 74B1 2876 FD89 0704



More information about the infrastructure mailing list