<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""> <div class=""><br class=""> </div> <div class="">The statistics in /proj/nlpl/data/OPUS/statistics.md</div> <div class="">should still be quite accurate for now.</div> <div class=""><br class=""> </div> <div class="">The core is actually quite small. If we only take the latest version of untokenized data this only amounts to 128GB. This almost shocking to myself …. Many things are derived from that and there is duplication that might make sense (like plain text versions of all bitexts, which would be easier to integrate in common MT training workflows). But we could probably start with the untokenized versions in XML. Note that this does not include the sentence alignments, which are part in the folder of tokenized XML files.</div> <div class=""><br class=""> </div> <div class="">Do you need more break-downs of statistics?</div> <div class=""><br class=""> </div> <div apple-content-edited="true" class=""> <div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""> <div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""> <div class="" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"> <span style="orphans: 2; widows: 2;" class="">Jörg</span></div> <div class="" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"> <span class="" style="orphans: 2; widows: 2;"><br class=""> </span></div> <div class="" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"> <span class="" style="orphans: 2; widows: 2;">********************************************************************************************</span><br class="" style="orphans: 2; widows: 2;"> <span class="" style="orphans: 2; widows: 2;">Jörg Tiedemann<span class="Apple-tab-span" style="white-space: pre;"> </span></span><a href="https://blogs.helsinki.fi/tiedeman/" class="">https://blogs.helsinki.fi/tiedeman/</a></div> <div class="" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"> <span class="" style="orphans: 2; widows: 2;">Language Technology<span class="Apple-tab-span" style="white-space: pre;"> </span></span><a href="https://blogs.helsinki.fi/language-technology/" class="">https://blogs.helsinki.fi/language-technology/</a></div> <div class=""><span style="orphans: 2; widows: 2;" class="">University of Helsinki</span></div> </div> </div> </div> <br class=""> <div> <blockquote type="cite" class=""> <div class="">On 25 Sep 2018, at 15:14, Stephan Oepen <<a href="mailto:oe@ifi.uio.no" class="">oe@ifi.uio.no</a>> wrote:</div> <br class="Apple-interchange-newline"> <div class=""> <blockquote type="cite" class="">Just to make sure that I know what is happing to OPUS data:<br class=""> - the entire /proj/nlpl/data/OPUS is now synced with abel?<br class=""> - all of it is now also backed up to NIRD (in 2 places?)<br class=""> </blockquote> <br class=""> only the latter is done to date: complete daily copies of the full<br class=""> project directories on Abel and Taito to NIRD.<br class=""> <br class=""> so far, there is no automated replication (of OPUS) from Taito to<br class=""> Abel. disk space used to hold us back, and probably still does. i<br class=""> managed to get one extra terabyte from USIT last week, which was<br class=""> necessary to replicate ‘data/corpora/’ and increase the vectors<br class=""> repository. they signaled we could ask for a little more later this<br class=""> fall, as people are moving away from Abel ...<br class=""> <br class=""> can you estimate how much space would be required for replicating OPUS<br class=""> on Abel? are there parts that could be omitted, because they are not<br class=""> so regularly needed? if we move towards more parallelism, including<br class=""> NMT installations on Abel, i suppose OPUS would be a natural part of<br class=""> that, right?<br class=""> <br class=""> oe<br class=""> </div> </blockquote> </div> <br class=""> </body> </html>