<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
Hi,
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">I am working on changing the release structure of OPUS with more zip-archives instead of individual gzipped files. Here are some figures in case we want to mirror the release files to abel:</div>
<div class=""><br class="">
</div>
<div class="">Total size: 4.8TB</div>
<div class="">Total number of files: 413,729</div>
<div class=""><br class="">
</div>
<div class="">This includes different versions of some sub-corpora and we can reduce quite a lot by only syncing the latest version. We could also leave out some derived data such as TMX or word alignment. Most importantly, it’s OpenSubtitles that takes by
far the biggest share of the space:</div>
<div class=""><br class="">
</div>
<div class="">5.6G OpenSubtitles/v1<br class="">
116G OpenSubtitles/v2011<br class="">
138G OpenSubtitles/v2012<br class="">
32G OpenSubtitles/v2013<br class="">
1.1T OpenSubtitles/v2016<br class="">
2.3T OpenSubtitles/v2018</div>
<div class=""><br class="">
</div>
<div class="">I would suggest to only synchronize the latest version (v2018). And here is the info about different release types and derived data for OpenSubs2018:</div>
<div class=""><br class="">
</div>
<div class="">2.7G OpenSubtitles/v2018/dic<br class="">
239M OpenSubtitles/v2018/freq<br class="">
62G OpenSubtitles/v2018/mono<br class="">
264G OpenSubtitles/v2018/moses<br class="">
312G OpenSubtitles/v2018/parsed<br class="">
101G OpenSubtitles/v2018/raw<br class="">
268G OpenSubtitles/v2018/tmx<br class="">
645G OpenSubtitles/v2018/wordalign<br class="">
636G OpenSubtitles/v2018/xml</div>
<div class=""><br class="">
</div>
<div class="">I would suggest to skip tmx, parsed (currently with UDPipe - quite bad) and possibly wordalign. This would reduce the size quite dramatically. I would then keep older versions in long-term storage and on taito.</div>
<div class=""><br class="">
</div>
<div class="">Would there be space on abel for such a reduced form of OPUS?</div>
<div class="">
<div class="">
<div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
<div class="">All the best,</div>
<div class="">Jörg</div>
<div class=""><br class="">
</div>
<div class="">
<div apple-content-edited="true" class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">
<div class="" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
<span class="" style="orphans: 2; widows: 2;">********************************************************************************************</span><br class="" style="orphans: 2; widows: 2;">
<span class="" style="orphans: 2; widows: 2;">Jörg Tiedemann</span></div>
<div class="" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
<span class="" style="orphans: 2; widows: 2;">Language Technology</span></div>
<div class="" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
<span style="orphans: 2; widows: 2;" class="">University of Helsinki</span></div>
</div>
</div>
</div>
</div>
</div>
</div>
<br class="">
</div>
</body>
</html>