[NLPL Task Force (A)] CoNLL-2017 raw data on taito

Fri Dec 1 09:29:49 UTC 2017

Hi, 

I hope I am not spamming "infrastructure" here, but this exchange is precisely the reason why I said in Oslo I benefit from NLPL. It is very difficult to get to conclusions below by just thinking about them. I made some measurements in terms of compression vs. uncompressed and found uncompressed significantly faster. But of course, it is all relative, as you point out. As discussed below compression solves some integrity issues as well. So I am all for it. And thanks for the trust, we do try not to lose bits, but in this case we first need to know we have the right kind of stuff :) 

Martin 

-- 
Martin Matthiesen 
CSC - Tieteen tietotekniikan keskus 
CSC - IT Center for Science 
PL 405, 02101 Espoo, Finland 
+358 9 457 2376, martin.matthiesen at csc.fi 
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 
Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704 

> From: "Filip Ginter" <figint at utu.fi>
> To: "Martin Matthiesen" <martin.matthiesen at csc.fi>
> Cc: "infrastructure" <infrastructure at nlpl.eu>
> Sent: Friday, 1 December, 2017 11:17:34
> Subject: Re: [NLPL Task Force (A)] CoNLL-2017 raw data on taito

> Good points, Martin. Making sure we don't lose them bits is what I trust CSC
> with. I guess I can be non-precision because I can rely you precision people
> running the systems. :)

> The original page with the downloads has MD5 hashes [
> https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 |
> https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 ] so I suppose
> we can check those. I'll take care of it.

> I am used to compress. My rationale here is that the processing applied to the
> data is typically heavy, the relative proportion of time spent decompressing is
> tiny. For example right now, when you look at "squeue -u ginter", you see that
> very much is happening with these .xz files as I am parsing them. The
> decompression is maybe 1% of the work, if even that. And I think decompressed
> these are easily 1.5TB, so this saves 1TB of space on taito. So I figured that
> is the way to go. :)

> Cheers

> F

> On Fri, Dec 1, 2017 at 11:08 AM, Martin Matthiesen < [
> mailto:martin.matthiesen at csc.fi | martin.matthiesen at csc.fi ] > wrote:

>> Hi Filip!

>>> From: "Filip Ginter" < [ mailto:figint at utu.fi | figint at utu.fi ] >
>>> To: "Martin Matthiesen" < [ mailto:martin.matthiesen at csc.fi |
>>> martin.matthiesen at csc.fi ] >
>>> Cc: "infrastructure" < [ mailto:infrastructure at nlpl.eu | infrastructure at nlpl.eu
>>> ] >
>>> Sent: Thursday, 30 November, 2017 21:38:11
>>> Subject: Re: [NLPL Task Force (A)] CoNLL-2017 raw data on taito

>>> Hi Martin

>>> Not sure about the hashes. If you think it's useful, then go for it. :)

>> That is the thing, I am not sure myself, I don't want to over-engineer things.

>>> Myself, I'm thinking this is not Space Shuttle plans nor bank records and if
>>> something goes haywire then xz will crash and we will redownload. :D I guess
>>> I'm not much of a precision guy with these things. :)

>> Well, we just got bitten with data getting lost in a conversion process and us
>> not noticing, I guess it is part of my job to run after those pesky bits. I
>> would like to get to some kind of process where you indeed would not need to be
>> a precision guy concerning data storage, but you can just rely it works ok. As
>> soon as we want to have one data set in two places (be it for performance
>> reasons in both places) I think we need to think about this. The other thing is
>> that you personally can always re-download, since you have full control, but
>> someone using your dataset cannot and unless we have a super fast process in
>> place to reliably fix such missing/corrupted data errors this might slow down
>> someone else's research.

>>> The tars are built the way they are in Prague. They actually contain a bunch of
>>> .xz files, so compression of the whole tar wouldn't anymore make any
>>> difference. I will unpack the tars, but of course not decompress the xz files.

>> Ok, I thought there was a reason for just tar. And indeed compressed files have
>> some inbuild integrity checking. Crashing xz/zip files are indeed less of a
>> problem than silently missing ones. I am curious to hear your experiences with
>> compression in the first place, I some time ago decided against it for data on
>> Taito, since you pay in processing time. For compressed data I think making
>> shure that the xz files are all there and work would be enough. My concern is
>> accidental data loss, not a deliberate one.

>> Maybe this an item for our infrastructure meeting. This is indeed not rocket
>> science, but even there rather trivial problems can have severe consequences:
>> [ https://edition.cnn.com/TECH/space/9909/30/mars.metric.02/ |
>> https://edition.cnn.com/TECH/space/9909/30/mars.metric.02/ ]

>> Regards,
>> Martin

>>> Cheers

>>> F

>>> On Thu, Nov 30, 2017 at 6:24 PM, Martin Matthiesen < [
>>> mailto:martin.matthiesen at csc.fi | martin.matthiesen at csc.fi ] > wrote:

>>>> Hi Filip,

>>>> This sounds good to me, This raises some interesting infra questions (to me at
>>>> least):

>>>> Could we compute a grand total hash that ensures that the whole thing is
>>>> correctly in place (eg [1])?
>>>> Would we want that on a per-tar file basis (to be able to use only a partial
>>>> corpus)?
>>>> And here I do not mean to hash the tar-file itself, but to make sure that the
>>>> extracted tar is in place correctly.

>>>> I am curious: Why did you not compress the tar files? To slow?

>>>> Cheers,
>>>> Martin

>>>> [1] [
>>>> https://stackoverflow.com/questions/4830089/how-to-checksum-an-entire-folder-structure
>>>> |
>>>> https://stackoverflow.com/questions/4830089/how-to-checksum-an-entire-folder-structure
>>>> ]

>>>> --
>>>> Martin Matthiesen
>>>> CSC - Tieteen tietotekniikan keskus
>>>> CSC - IT Center for Science
>>>> PL 405, 02101 Espoo, Finland
>>>> [ tel:+358%209%204572376 | +358 9 457 2376 ] , [ mailto:martin.matthiesen at csc.fi
>>>> |
>>>> martin.matthiesen at csc.fi ]
>>>> Public key : [ https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 |
>>>> https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 ]
>>>> Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704

>>>>> From: "Filip Ginter" < [ mailto:ginter at cs.utu.fi | ginter at cs.utu.fi ] >
>>>>> To: "infrastructure" < [ mailto:infrastructure at nlpl.eu | infrastructure at nlpl.eu
>>>>> ] >
>>>>> Sent: Thursday, 30 November, 2017 10:15:06
>>>>> Subject: [NLPL Task Force (A)] CoNLL-2017 raw data on taito

>>>>> Hi guys

>>>>> Is it okay for me to stick this data [
>>>>> https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 |
>>>>> https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 ] to the nlpl
>>>>> directory on taito? We actually have this data in one of our researcher's work
>>>>> directory on taito, so the total space usage on taito stays. 522GB. Thiis is a
>>>>> useful dataset for parser training etc.

>>>>> - Filip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20171201/b062cbdb/attachment.htm>