[NLPL Task Force (A)] mirror OPUS to Saga
Tiedemann, Jörg
jorg.tiedemann at helsinki.fi
Tue Sep 17 13:21:25 UTC 2019
> i would like to start to rsync(1) the OPUS data from Taito to Saga,
> but i am not sure whether i should use the complete '/proj/nlpl/OPUS/'
> directory from Taito? joerg, are there sensible exception patterns
> that i should use, say to initially limit the replication to file
> formats that we expect will actually be put to use on Saga? i believe
> we currently have available a total of eight terabytes in our
> norwegian community directory, though truth be told i am not quite
> sure (i believe we can increase our quota). is there a sensible
> sub-set of OPUS that we could accomodate in, say, four to five
> terabytes on Saga?
I would leave out
**/info/*
**/parsed/*
**/smt/*
**/tmx/*
There are also several releases for some of the corpora and you probably only need the latest version. The release version is always one level below the sub-corpus name. latest is a symbolic link (I know - not perfect as a solution) to the latest version. Leaving out releases could be important for big corpora like OpenSubtitles and ParaCrawl. Reducing even more could be done by leaving out
**/moses/*
(those can be generated from the XML) - but they are quite handy to quickly start with MT.
Also quite big and not really necessary right now are
OpenSubtitles/v2018/xml/*alt*
>
> i have also started to update our software installation guide. with
> the separation of software and data on Puhti (and generally increasing
> diversity in path prefixes across systems), i would like to put more
> emphasis onto establishing environment variables $NLPLCODE and
> $NLPLDATA (and maybe others, as need be). also, i would like to leave
> open the possibility of other means of installing software than the
> modules system, down the road, hence suggest to add a new top-level
> entry '.../software/modules/', for those packages that are provisioned
> as modules, with module definition files now in
> '.../software/modules/etc/' (for backwards compatibility on Abel and
> Taito, i have created corresponding soft links).
>
> could you please see whether the moderately updated installation guide
> looks okay?
>
> http://wiki.nlpl.eu/index.php/Infrastructure/installation/guide
>
> more soon! oe
More information about the infrastructure
mailing list