[NLPL Task Force (A)] CoNLL-2017 raw data on taito

Fri Dec 1 09:08:54 UTC 2017

Hi Filip! 

> From: "Filip Ginter" <figint at utu.fi>
> To: "Martin Matthiesen" <martin.matthiesen at csc.fi>
> Cc: "infrastructure" <infrastructure at nlpl.eu>
> Sent: Thursday, 30 November, 2017 21:38:11
> Subject: Re: [NLPL Task Force (A)] CoNLL-2017 raw data on taito

> Hi Martin

> Not sure about the hashes. If you think it's useful, then go for it. :)

That is the thing, I am not sure myself, I don't want to over-engineer things. 

> Myself, I'm thinking this is not Space Shuttle plans nor bank records and if
> something goes haywire then xz will crash and we will redownload. :D I guess
> I'm not much of a precision guy with these things. :)

Well, we just got bitten with data getting lost in a conversion process and us not noticing, I guess it is part of my job to run after those pesky bits. I would like to get to some kind of process where you indeed would not need to be a precision guy concerning data storage, but you can just rely it works ok. As soon as we want to have one data set in two places (be it for performance reasons in both places) I think we need to think about this. The other thing is that you personally can always re-download, since you have full control, but someone using your dataset cannot and unless we have a super fast process in place to reliably fix such missing/corrupted data errors this might slow down someone else's research. 

> The tars are built the way they are in Prague. They actually contain a bunch of
> .xz files, so compression of the whole tar wouldn't anymore make any
> difference. I will unpack the tars, but of course not decompress the xz files.

Ok, I thought there was a reason for just tar. And indeed compressed files have some inbuild integrity checking. Crashing xz/zip files are indeed less of a problem than silently missing ones. I am curious to hear your experiences with compression in the first place, I some time ago decided against it for data on Taito, since you pay in processing time. For compressed data I think making shure that the xz files are all there and work would be enough. My concern is accidental data loss, not a deliberate one. 

Maybe this an item for our infrastructure meeting. This is indeed not rocket science, but even there rather trivial problems can have severe consequences: 
https://edition.cnn.com/TECH/space/9909/30/mars.metric.02/ 

Regards, 
Martin 

> Cheers

> F

> On Thu, Nov 30, 2017 at 6:24 PM, Martin Matthiesen < [
> mailto:martin.matthiesen at csc.fi | martin.matthiesen at csc.fi ] > wrote:

>> Hi Filip,

>> This sounds good to me, This raises some interesting infra questions (to me at
>> least):

>> Could we compute a grand total hash that ensures that the whole thing is
>> correctly in place (eg [1])?
>> Would we want that on a per-tar file basis (to be able to use only a partial
>> corpus)?
>> And here I do not mean to hash the tar-file itself, but to make sure that the
>> extracted tar is in place correctly.

>> I am curious: Why did you not compress the tar files? To slow?

>> Cheers,
>> Martin

>> [1] [
>> https://stackoverflow.com/questions/4830089/how-to-checksum-an-entire-folder-structure
>> |
>> https://stackoverflow.com/questions/4830089/how-to-checksum-an-entire-folder-structure
>> ]

>> --
>> Martin Matthiesen
>> CSC - Tieteen tietotekniikan keskus
>> CSC - IT Center for Science
>> PL 405, 02101 Espoo, Finland
>> [ tel:+358%209%204572376 | +358 9 457 2376 ] , [ mailto:martin.matthiesen at csc.fi
>> | martin.matthiesen at csc.fi ]
>> Public key : [ https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 |
>> https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 ]
>> Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704

>>> From: "Filip Ginter" < [ mailto:ginter at cs.utu.fi | ginter at cs.utu.fi ] >
>>> To: "infrastructure" < [ mailto:infrastructure at nlpl.eu | infrastructure at nlpl.eu
>>> ] >
>>> Sent: Thursday, 30 November, 2017 10:15:06
>>> Subject: [NLPL Task Force (A)] CoNLL-2017 raw data on taito

>>> Hi guys

>>> Is it okay for me to stick this data [
>>> https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 |
>>> https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 ] to the nlpl
>>> directory on taito? We actually have this data in one of our researcher's work
>>> directory on taito, so the total space usage on taito stays. 522GB. Thiis is a
>>> useful dataset for parser training etc.

>>> - Filip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20171201/d9bc9e71/attachment.htm>