<div><div dir="auto">okay! could you run something like ‘chmod -R g+w /projects/nlpl/data/OPUS’ on Abel, so i get to selectively delete those files?</div></div><div dir="auto"><br></div><div dir="auto">oe</div><div dir="auto"><br></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 13 Feb 2019 at 18:04 Tiedemann, Jörg <<a href="mailto:jorg.tiedemann@helsinki.fi">jorg.tiedemann@helsinki.fi</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word">
Hi,
<div><br>
</div>
<div>I would leave xml and raw as the primary data files and you can delete “moses” and “mono”, which are both derived plain text data versions. That should save enough space I guess as especially the OpenSubtitles corpus takes most of the space and
the moses directory already occupies 264G. mono is another 31G. Would that be sufficient.</div>
<div><br>
</div>
<div>The reason for keeping raw is because this is the non-tokenized XML, which is probably more important than xml, which is the tokenized version. In many cases, people would like to apply their own tokenization/preprocessing pipeline to be consistent
with any downstream task later on.</div>
<div><br>
</div>
<div>The problem is that xml contains the sentence alignment files that you need to keep. It’s a bit mixed and therefore not easy to separate in cronjobs without lots of specific rules for excluding files etc. Can some of the syads change the owenership
of the files on abel so that you can run your cronjobs?</div>
<div><br>
</div>
<div><br>
<div>
<div style="color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:break-word">
<div style="color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:break-word">
<div style="color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:break-word">
<div style="word-wrap:break-word">
<span>All the best,</span></div>
<div style="word-wrap:break-word">
Jörg</div>
<div style="word-wrap:break-word">
<span><br>
</span></div>
<div style="word-wrap:break-word">
<span>********************************************************************************************</span></div></div></div></div></div></div></div><div style="word-wrap:break-word"><div><div><div style="color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:break-word"><div style="color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:break-word"><div style="color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:break-word"><div style="word-wrap:break-word"><br>
<span>Jörg Tiedemann</span></div>
<div style="word-wrap:break-word">
<span>Language Technology<span class="m_9165083973474320842Apple-tab-span" style="white-space:pre-wrap">
</span></span><a href="https://blogs.helsinki.fi/language-technology/" target="_blank">https://blogs.helsinki.fi/language-technology/</a></div>
<div><span>University of Helsinki</span></div>
</div>
</div>
</div>
</div>
<br>
<div>
<blockquote type="cite">
<div>On 11 Feb 2019, at 20:08, Stephan Oepen <<a href="mailto:oe@ifi.uio.no" target="_blank">oe@ifi.uio.no</a>> wrote:</div>
<br class="m_9165083973474320842Apple-interchange-newline">
<div>hi joerg,<br>
<br>
our NLPL partition on Abel has hit the disk quota limit (two<br>
terabytes), which means we cannot install software updates. i am<br>
afraid i would like to propose that we further restrict the OPUS<br>
mirror on Abel, as it accounts for by far the biggest ‘chunk’ of NLPL<br>
data (715 gigabytes currently on Abel). would it make sense to just<br>
keep the XML variants of the data (i am guessing the fairly bulky<br>
‘moses’ and ‘raw’ variants are derived)?<br>
<br>
more generally, i was planning to suggest that we move to automated<br>
mirroring of the most important parts of OPUS from Taito to Abel, as<br>
we do for most of the other data sub-directories now. could you (a)<br>
suggest an rsync(1) command to selective copy from Taito to Abel and<br>
(b) temporarilty ‘\rm -rf /projects/nlpl/data/OPUS’ on Abel?<br>
<br>
i could then include the rsync(1) in my nightly cron(5) job on Taito,<br>
such that for the selected parts at least the two copies would remain<br>
synchronized (because the cron(5) jobs runs in my account, i will have<br>
to be the owner of the rsync(1) target directory on Abel).<br>
<br>
best wishes, oe<br>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote></div></div>