[NLPL Task Force (A)] OPUS copy on Abel

Tiedemann, Jörg jorg.tiedemann at helsinki.fi
Thu Feb 14 19:41:51 UTC 2019


ok - done
Jörg

********************************************************************************************
Jörg Tiedemann
Language Technology https://blogs.helsinki.fi/language-technology/
University of Helsinki

On 13 Feb 2019, at 19:29, Stephan Oepen <oe at ifi.uio.no<mailto:oe at ifi.uio.no>> wrote:

okay!  could you run something like ‘chmod -R g+w /projects/nlpl/data/OPUS’ on Abel, so i get to selectively delete those files?

oe


On Wed, 13 Feb 2019 at 18:04 Tiedemann, Jörg <jorg.tiedemann at helsinki.fi<mailto:jorg.tiedemann at helsinki.fi>> wrote:
Hi,

I would leave xml and raw as the primary data files and you can delete “moses” and “mono”, which are both derived plain text data versions. That should save enough space I guess as especially the OpenSubtitles corpus takes most of the space and the moses directory already occupies 264G. mono is another 31G. Would that be sufficient.

The reason for keeping raw is because this is the non-tokenized XML, which is probably more important than xml, which is the tokenized version. In many cases, people would like to apply their own tokenization/preprocessing pipeline to be consistent with any downstream task later on.

The problem is that xml contains the sentence alignment files that you need to keep. It’s a bit mixed and therefore not easy to separate in cronjobs without lots of specific rules for excluding files etc. Can some of the syads change the owenership of the files on abel so that you can run your cronjobs?


All the best,
Jörg

********************************************************************************************

Jörg Tiedemann
Language Technology https://blogs.helsinki.fi/language-technology/
University of Helsinki

On 11 Feb 2019, at 20:08, Stephan Oepen <oe at ifi.uio.no<mailto:oe at ifi.uio.no>> wrote:

hi joerg,

our NLPL partition on Abel has hit the disk quota limit (two
terabytes), which means we cannot install software updates.  i am
afraid i would like to propose that we further restrict the OPUS
mirror on Abel, as it accounts for by far the biggest ‘chunk’ of NLPL
data (715 gigabytes currently on Abel).  would it make sense to just
keep the XML variants of the data (i am guessing the fairly bulky
‘moses’ and ‘raw’ variants are derived)?

more generally, i was planning to suggest that we move to automated
mirroring of the most important parts of OPUS from Taito to Abel, as
we do for most of the other data sub-directories now.  could you (a)
suggest an rsync(1) command to selective copy from Taito to Abel and
(b) temporarilty ‘\rm -rf /projects/nlpl/data/OPUS’ on Abel?

i could then include the rsync(1) in my nightly cron(5) job on Taito,
such that for the selected parts at least the two copies would remain
synchronized (because the cron(5) jobs runs in my account, i will have
to be the owner of the rsync(1) target directory on Abel).

best wishes, oe


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20190214/565c069e/attachment.htm>


More information about the infrastructure mailing list