[NLPL Task Force (A)] CoNLL-2017 raw data on taito
Filip Ginter
figint at utu.fi
Thu Nov 30 19:38:11 UTC 2017
Hi Martin
Not sure about the hashes. If you think it's useful, then go for it. :)
Myself, I'm thinking this is not Space Shuttle plans nor bank records and
if something goes haywire then xz will crash and we will redownload. :D I
guess I'm not much of a precision guy with these things. :)
The tars are built the way they are in Prague. They actually contain a
bunch of .xz files, so compression of the whole tar wouldn't anymore make
any difference. I will unpack the tars, but of course not decompress the xz
files.
Cheers
F
On Thu, Nov 30, 2017 at 6:24 PM, Martin Matthiesen <martin.matthiesen at csc.fi
> wrote:
> Hi Filip,
>
> This sounds good to me, This raises some interesting infra questions (to
> me at least):
>
> Could we compute a grand total hash that ensures that the whole thing is
> correctly in place (eg [1])?
> Would we want that on a per-tar file basis (to be able to use only a
> partial corpus)?
> And here I do not mean to hash the tar-file itself, but to make sure that
> the extracted tar is in place correctly.
>
> I am curious: Why did you not compress the tar files? To slow?
>
> Cheers,
> Martin
>
>
> [1] https://stackoverflow.com/questions/4830089/how-to-
> checksum-an-entire-folder-structure
>
> --
> Martin Matthiesen
> CSC - Tieteen tietotekniikan keskus
> CSC - IT Center for Science
> PL 405, 02101 Espoo, Finland
> +358 9 457 2376 <+358%209%204572376>, martin.matthiesen at csc.fi
> Public key : https://pgp.mit.edu/pks/lookup?op=get&search=
> 0x74B12876FD890704
> Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704
>
> ------------------------------
>
> *From: *"Filip Ginter" <ginter at cs.utu.fi>
> *To: *"infrastructure" <infrastructure at nlpl.eu>
> *Sent: *Thursday, 30 November, 2017 10:15:06
> *Subject: *[NLPL Task Force (A)] CoNLL-2017 raw data on taito
>
> Hi guys
>
> Is it okay for me to stick this data https://lindat.mff.cuni.cz/
> repository/xmlui/handle/11234/1-1989 to the nlpl directory on taito? We
> actually have this data in one of our researcher's work directory on taito,
> so the total space usage on taito stays. 522GB. Thiis is a useful dataset
> for parser training etc.
>
> - Filip
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20171130/125dbff5/attachment.htm>
More information about the infrastructure
mailing list