[NLPL Task Force (A)] /proj/nlpl/modules/ on Taito
Tiedemann, Jörg
jorg.tiedemann at helsinki.fi
Tue Sep 25 12:30:53 UTC 2018
The statistics in /proj/nlpl/data/OPUS/statistics.md
should still be quite accurate for now.
The core is actually quite small. If we only take the latest version of untokenized data this only amounts to 128GB. This almost shocking to myself …. Many things are derived from that and there is duplication that might make sense (like plain text versions of all bitexts, which would be easier to integrate in common MT training workflows). But we could probably start with the untokenized versions in XML. Note that this does not include the sentence alignments, which are part in the folder of tokenized XML files.
Do you need more break-downs of statistics?
Jörg
********************************************************************************************
Jörg Tiedemann https://blogs.helsinki.fi/tiedeman/
Language Technology https://blogs.helsinki.fi/language-technology/
University of Helsinki
On 25 Sep 2018, at 15:14, Stephan Oepen <oe at ifi.uio.no<mailto:oe at ifi.uio.no>> wrote:
Just to make sure that I know what is happing to OPUS data:
- the entire /proj/nlpl/data/OPUS is now synced with abel?
- all of it is now also backed up to NIRD (in 2 places?)
only the latter is done to date: complete daily copies of the full
project directories on Abel and Taito to NIRD.
so far, there is no automated replication (of OPUS) from Taito to
Abel. disk space used to hold us back, and probably still does. i
managed to get one extra terabyte from USIT last week, which was
necessary to replicate ‘data/corpora/’ and increase the vectors
repository. they signaled we could ask for a little more later this
fall, as people are moving away from Abel ...
can you estimate how much space would be required for replicating OPUS
on Abel? are there parts that could be omitted, because they are not
so regularly needed? if we move towards more parallelism, including
NMT installations on Abel, i suppose OPUS would be a natural part of
that, right?
oe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nlpl.eu/archives/infrastructure/attachments/20180925/78aa1c90/attachment.htm>
More information about the infrastructure
mailing list