[NLPL Task Force (A)] CoNLL-2017 raw data on taito

Fri Dec 1 09:17:34 UTC 2017

Good points, Martin. Making sure we don't lose them bits is what I trust
CSC with. I guess I can be non-precision because I can rely you precision
people running the systems. :)

The original page with the downloads has MD5 hashes
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989  so I
suppose we can check those. I'll take care of it.

I am used to compress. My rationale here is that the processing applied to
the data is typically heavy, the relative proportion of time spent
decompressing is tiny. For example right now, when you look at "squeue -u
ginter", you see that very much is happening with these .xz files as I am
parsing them. The decompression is maybe 1% of the work, if even that. And
I think decompressed these are easily 1.5TB, so this saves 1TB of space on
taito. So I figured that is the way to go. :)

Cheers

F

On Fri, Dec 1, 2017 at 11:08 AM, Martin Matthiesen <martin.matthiesen at csc.fi
> wrote:

> Hi Filip!
>
> ------------------------------
>
> *From: *"Filip Ginter" <figint at utu.fi>
> *To: *"Martin Matthiesen" <martin.matthiesen at csc.fi>
> *Cc: *"infrastructure" <infrastructure at nlpl.eu>
> *Sent: *Thursday, 30 November, 2017 21:38:11
> *Subject: *Re: [NLPL Task Force (A)] CoNLL-2017 raw data on taito
>
> Hi Martin
>
> Not sure about the hashes. If you think it's useful, then go for it. :)
>
> That is the thing, I am not sure myself, I don't want to over-engineer
> things.
>
> Myself, I'm thinking this is not Space Shuttle plans nor bank records and
> if something goes haywire then xz will crash and we will redownload. :D I
> guess I'm not much of a precision guy with these things. :)
>
> Well, we just got bitten with data getting lost in a conversion process
> and us not noticing, I guess it is part of my job to run after those pesky
> bits. I would like to get to some kind of process where you indeed would
> not need to be a precision guy concerning data storage, but you can just
> rely it works ok. As soon as we want to have one data set in two places (be
> it for performance reasons in both places) I think we need to think about
> this. The other thing is that you personally can always re-download, since
> you have full control, but someone using your dataset cannot and unless we
> have a super fast process in place to reliably fix such missing/corrupted
> data errors this might slow down someone else's research.
>
>
> The tars are built the way they are in Prague. They actually contain a
> bunch of .xz files, so compression of the whole tar wouldn't anymore make
> any difference. I will unpack the tars, but of course not decompress the xz
> files.
>
> Ok, I thought there was a reason for just tar. And indeed compressed files
> have some inbuild integrity checking. Crashing xz/zip files are indeed less
> of a problem than silently missing ones. I am curious to hear your
> experiences with compression in the first place, I some time ago decided
> against it for data on Taito, since you pay in processing time. For
> compressed data I think making shure that the xz files are all there and
> work would be enough. My concern is accidental data loss, not a deliberate
> one.
>
> Maybe this an item for our infrastructure meeting. This is indeed not
> rocket science, but even there rather trivial problems can have severe
> consequences:
> https://edition.cnn.com/TECH/space/9909/30/mars.metric.02/
>
> Regards,
> Martin
>
>
>
> Cheers
>
> F
>
>
> On Thu, Nov 30, 2017 at 6:24 PM, Martin Matthiesen <
> martin.matthiesen at csc.fi> wrote:
>
>> Hi Filip,
>>
>> This sounds good to me, This raises some interesting infra questions (to
>> me at least):
>>
>> Could we compute a grand total hash that ensures that the whole thing is
>> correctly in place (eg [1])?
>> Would we want that on a per-tar file basis (to be able to use only a
>> partial corpus)?
>> And here I do not mean to hash the tar-file itself, but to make sure that
>> the extracted tar is in place correctly.
>>
>> I am curious: Why did you not compress the tar files? To slow?
>>
>> Cheers,
>> Martin
>>
>>
>> [1] https://stackoverflow.com/questions/4830089/how-to-
>> checksum-an-entire-folder-structure
>>
>> --
>> Martin Matthiesen
>> CSC - Tieteen tietotekniikan keskus
>> CSC - IT Center for Science
>> PL 405, 02101 Espoo, Finland
>> +358 9 457 2376 <+358%209%204572376>, martin.matthiesen at csc.fi
>> Public key : https://pgp.mit.edu/pks/lookup?op=get&search=
>> 0x74B12876FD890704
>> Fingerprint: AA25 6F56 5C9A 8B42 009F  BA70 74B1 2876 FD89 0704
>>
>> ------------------------------
>>
>> *From: *"Filip Ginter" <ginter at cs.utu.fi>
>> *To: *"infrastructure" <infrastructure at nlpl.eu>
>> *Sent: *Thursday, 30 November, 2017 10:15:06
>> *Subject: *[NLPL Task Force (A)] CoNLL-2017 raw data on taito
>>
>> Hi guys
>>
>> Is it okay for me to stick this data https://lindat.mff.cuni.cz/
>> repository/xmlui/handle/11234/1-1989 to the nlpl directory on taito? We
>> actually have this data in one of our researcher's work directory on taito,
>> so the total space usage on taito stays. 522GB. Thiis is a useful dataset
>> for parser training etc.
>>
>> - Filip
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20171201/8a6321e6/attachment.htm>