[NLPL Task Force (A)] estimates for OPUS storage requirements

Tiedemann, Jörg jorg.tiedemann at helsinki.fi
Mon Apr 16 06:09:40 UTC 2018


Hi,


I am working on changing the release structure of OPUS with more zip-archives instead of individual gzipped files. Here are some figures in case we want to mirror the release files to abel:

Total size: 4.8TB
Total number of files: 413,729

This includes different versions of some sub-corpora and we can reduce quite a lot by only syncing the latest version. We could also leave out some derived data such as TMX or word alignment. Most importantly, it’s OpenSubtitles that takes by far the biggest share of the space:

5.6G    OpenSubtitles/v1
116G    OpenSubtitles/v2011
138G    OpenSubtitles/v2012
32G     OpenSubtitles/v2013
1.1T    OpenSubtitles/v2016
2.3T    OpenSubtitles/v2018

I would suggest to only synchronize the latest version (v2018). And here is the info about different release types and derived data for OpenSubs2018:

2.7G    OpenSubtitles/v2018/dic
239M    OpenSubtitles/v2018/freq
62G     OpenSubtitles/v2018/mono
264G    OpenSubtitles/v2018/moses
312G    OpenSubtitles/v2018/parsed
101G    OpenSubtitles/v2018/raw
268G    OpenSubtitles/v2018/tmx
645G    OpenSubtitles/v2018/wordalign
636G    OpenSubtitles/v2018/xml

I would suggest to skip tmx, parsed (currently with UDPipe - quite bad) and possibly wordalign. This would reduce the size quite dramatically. I would then keep older versions in long-term storage and on taito.

Would there be space on abel for such a reduced form of OPUS?
All the best,
Jörg

********************************************************************************************
Jörg Tiedemann
Language Technology
University of Helsinki

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20180416/34848df7/attachment.htm>


More information about the infrastructure mailing list