[NLPL Task Force (A)] OPUS copy on Abel
Stephan Oepen
oe at ifi.uio.no
Wed Feb 13 17:29:45 UTC 2019
okay! could you run something like ‘chmod -R g+w /projects/nlpl/data/OPUS’
on Abel, so i get to selectively delete those files?
oe
On Wed, 13 Feb 2019 at 18:04 Tiedemann, Jörg <jorg.tiedemann at helsinki.fi>
wrote:
> Hi,
>
> I would leave xml and raw as the primary data files and you can delete
> “moses” and “mono”, which are both derived plain text data versions. That
> should save enough space I guess as especially the OpenSubtitles corpus
> takes most of the space and the moses directory already occupies 264G. mono
> is another 31G. Would that be sufficient.
>
> The reason for keeping raw is because this is the non-tokenized XML, which
> is probably more important than xml, which is the tokenized version. In
> many cases, people would like to apply their own tokenization/preprocessing
> pipeline to be consistent with any downstream task later on.
>
> The problem is that xml contains the sentence alignment files that you
> need to keep. It’s a bit mixed and therefore not easy to separate in
> cronjobs without lots of specific rules for excluding files etc. Can some
> of the syads change the owenership of the files on abel so that you can run
> your cronjobs?
>
>
> All the best,
> Jörg
>
>
> ********************************************************************************************
>
> Jörg Tiedemann
> Language Technology https://blogs.helsinki.fi/language-technology/
> University of Helsinki
>
> On 11 Feb 2019, at 20:08, Stephan Oepen <oe at ifi.uio.no> wrote:
>
> hi joerg,
>
> our NLPL partition on Abel has hit the disk quota limit (two
> terabytes), which means we cannot install software updates. i am
> afraid i would like to propose that we further restrict the OPUS
> mirror on Abel, as it accounts for by far the biggest ‘chunk’ of NLPL
> data (715 gigabytes currently on Abel). would it make sense to just
> keep the XML variants of the data (i am guessing the fairly bulky
> ‘moses’ and ‘raw’ variants are derived)?
>
> more generally, i was planning to suggest that we move to automated
> mirroring of the most important parts of OPUS from Taito to Abel, as
> we do for most of the other data sub-directories now. could you (a)
> suggest an rsync(1) command to selective copy from Taito to Abel and
> (b) temporarilty ‘\rm -rf /projects/nlpl/data/OPUS’ on Abel?
>
> i could then include the rsync(1) in my nightly cron(5) job on Taito,
> such that for the selected parts at least the two copies would remain
> synchronized (because the cron(5) jobs runs in my account, i will have
> to be the owner of the rsync(1) target directory on Abel).
>
> best wishes, oe
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20190213/aa9aa325/attachment.htm>
More information about the infrastructure
mailing list