[NLPL Task Force (A)] partial mirror of OPUS to abel
Tiedemann, Jörg
jorg.tiedemann at helsinki.fi
Wed Dec 19 20:04:16 UTC 2018
I now started a sync to abel with raw XML-files and sentence alignments. No other derived or pre-processed data formats. This should be fine with the space constraints and constitutes the core of the OPUS data. For the use in MT it may be easier to start with the plain text (moses-style) formats. I could also sync them to abel if we want to have them available for immediate use in MT training. It’s up to you in Oslo. Those files can, of course, also be generated from the XML but that takes some time (especially for the big corpora).
I will monitor the current process and report on disk usage later tomorrow or so.
Jörg
********************************************************************************************
Jörg Tiedemann
Language Technology https://blogs.helsinki.fi/language-technology/
University of Helsinki
On 19 Dec 2018, at 15:43, Stephan Oepen <oe at ifi.uio.no<mailto:oe at ifi.uio.no>> wrote:
hi joerg,
our storage on Abel was extended to two terabytes this fall, and
currently we have some 800 gigabytes available.
i feel i (still) know too little about OPUS to say whether a partial
replica on Abel would be beneficial to NLPL users? could you suggest
a sub-set (below 800 gigabytes) to mirror from Taito, and sketch a
typical use case? could we sketch the reciple for a user to train
their OpenNMT-py system (more or less) straight from the OPUS
directory?
cheers, oe
On Wed, Dec 19, 2018 at 2:37 PM Tiedemann, Jörg
<jorg.tiedemann at helsinki.fi<mailto:jorg.tiedemann at helsinki.fi>> wrote:
This is especially for Stephan: One of the deliverables for this year in the OPUS activity is to create a partial mirror of OPUS data on abel. So far, I still don’t really know what we would like to make available and what kind of space we have for that on abel. In some sense, it could be enough to have that availability via the NIRD storage that you already fill with OPUS data, right? This also counts on longterm storage I guess. I also have the data in IDA here on CSC.
This is activity G1.4 and i wonder if I have to do something about it:
http://wiki.nlpl.eu/index.php/Infrastructure/home
All the best,
Jörg
********************************************************************************************
Jörg Tiedemann
Language Technology https://blogs.helsinki.fi/language-technology/
University of Helsinki
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20181219/5582df7e/attachment.htm>
More information about the infrastructure
mailing list