[NLPL Task Force (A)] CoNLL-2017 raw data on taito

Martin Matthiesen martin.matthiesen at csc.fi
Thu Nov 30 16:24:46 UTC 2017


Hi Filip, 

This sounds good to me, This raises some interesting infra questions (to me at least): 

Could we compute a grand total hash that ensures that the whole thing is correctly in place (eg [1])? 
Would we want that on a per-tar file basis (to be able to use only a partial corpus)? 
And here I do not mean to hash the tar-file itself, but to make sure that the extracted tar is in place correctly. 

I am curious: Why did you not compress the tar files? To slow? 

Cheers, 
Martin 

[1] https://stackoverflow.com/questions/4830089/how-to-checksum-an-entire-folder-structure 

-- 
Martin Matthiesen 
CSC - Tieteen tietotekniikan keskus 
CSC - IT Center for Science 
PL 405, 02101 Espoo, Finland 
+358 9 457 2376, martin.matthiesen at csc.fi 
Public key : https://pgp.mit.edu/pks/lookup?op=get&search=0x74B12876FD890704 
Fingerprint: AA25 6F56 5C9A 8B42 009F BA70 74B1 2876 FD89 0704 

> From: "Filip Ginter" <ginter at cs.utu.fi>
> To: "infrastructure" <infrastructure at nlpl.eu>
> Sent: Thursday, 30 November, 2017 10:15:06
> Subject: [NLPL Task Force (A)] CoNLL-2017 raw data on taito

> Hi guys

> Is it okay for me to stick this data [
> https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 |
> https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 ] to the nlpl
> directory on taito? We actually have this data in one of our researcher's work
> directory on taito, so the total space usage on taito stays. 522GB. Thiis is a
> useful dataset for parser training etc.

> - Filip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20171130/3c4fd195/attachment.htm>


More information about the infrastructure mailing list