<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
Hi,
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">I am working on changing the release structure of OPUS with more zip-archives instead of individual gzipped files. Here are some figures in case we want to mirror the release files to abel:</div>
<div class=""><br class="">
</div>
<div class="">Total size: 4.8TB</div>
<div class="">Total number of files: 413,729</div>
<div class=""><br class="">
</div>
<div class="">This includes different versions of some sub-corpora and we can reduce quite a lot by only syncing the latest version. We could also leave out some derived data such as TMX or word alignment. Most importantly, it’s OpenSubtitles that takes by
 far the biggest share of the space:</div>
<div class=""><br class="">
</div>
<div class="">5.6G    OpenSubtitles/v1<br class="">
116G    OpenSubtitles/v2011<br class="">
138G    OpenSubtitles/v2012<br class="">
32G     OpenSubtitles/v2013<br class="">
1.1T    OpenSubtitles/v2016<br class="">
2.3T    OpenSubtitles/v2018</div>
<div class=""><br class="">
</div>
<div class="">I would suggest to only synchronize the latest version (v2018). And here is the info about different release types and derived data for OpenSubs2018:</div>
<div class=""><br class="">
</div>
<div class="">2.7G    OpenSubtitles/v2018/dic<br class="">
239M    OpenSubtitles/v2018/freq<br class="">
62G     OpenSubtitles/v2018/mono<br class="">
264G    OpenSubtitles/v2018/moses<br class="">
312G    OpenSubtitles/v2018/parsed<br class="">
101G    OpenSubtitles/v2018/raw<br class="">
268G    OpenSubtitles/v2018/tmx<br class="">
645G    OpenSubtitles/v2018/wordalign<br class="">
636G    OpenSubtitles/v2018/xml</div>
<div class=""><br class="">
</div>
<div class="">I would suggest to skip tmx, parsed (currently with UDPipe - quite bad) and possibly wordalign. This would reduce the size quite dramatically. I would then keep older versions in long-term storage and on taito.</div>
<div class=""><br class="">
</div>
<div class="">Would there be space on abel for such a reduced form of OPUS?</div>
<div class="">
<div class="">
<div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
<div class="">All the best,</div>
<div class="">Jörg</div>
<div class=""><br class="">
</div>
<div class="">
<div apple-content-edited="true" class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
<div class="" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
<span class="" style="orphans: 2; widows: 2;">********************************************************************************************</span><br class="" style="orphans: 2; widows: 2;">
<span class="" style="orphans: 2; widows: 2;">Jörg Tiedemann</span></div>
<div class="" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
<span class="" style="orphans: 2; widows: 2;">Language Technology</span></div>
<div class="" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
<span style="orphans: 2; widows: 2;" class="">University of Helsinki</span></div>
</div>
</div>
</div>
</div>
</div>
</div>
<br class="">
</div>
</body>
</html>