[NLPL Task Force (A)] estimates for OPUS storage requirements
Tiedemann, Jörg
jorg.tiedemann at helsinki.fi
Mon Apr 16 06:09:40 UTC 2018
Hi,
I am working on changing the release structure of OPUS with more zip-archives instead of individual gzipped files. Here are some figures in case we want to mirror the release files to abel:
Total size: 4.8TB
Total number of files: 413,729
This includes different versions of some sub-corpora and we can reduce quite a lot by only syncing the latest version. We could also leave out some derived data such as TMX or word alignment. Most importantly, it’s OpenSubtitles that takes by far the biggest share of the space:
5.6G OpenSubtitles/v1
116G OpenSubtitles/v2011
138G OpenSubtitles/v2012
32G OpenSubtitles/v2013
1.1T OpenSubtitles/v2016
2.3T OpenSubtitles/v2018
I would suggest to only synchronize the latest version (v2018). And here is the info about different release types and derived data for OpenSubs2018:
2.7G OpenSubtitles/v2018/dic
239M OpenSubtitles/v2018/freq
62G OpenSubtitles/v2018/mono
264G OpenSubtitles/v2018/moses
312G OpenSubtitles/v2018/parsed
101G OpenSubtitles/v2018/raw
268G OpenSubtitles/v2018/tmx
645G OpenSubtitles/v2018/wordalign
636G OpenSubtitles/v2018/xml
I would suggest to skip tmx, parsed (currently with UDPipe - quite bad) and possibly wordalign. This would reduce the size quite dramatically. I would then keep older versions in long-term storage and on taito.
Would there be space on abel for such a reduced form of OPUS?
All the best,
Jörg
********************************************************************************************
Jörg Tiedemann
Language Technology
University of Helsinki
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20180416/34848df7/attachment.htm>
More information about the infrastructure
mailing list