[NLPL Task Force (A)] mirror OPUS to Saga

Tiedemann, Jörg jorg.tiedemann at helsinki.fi
Tue Sep 17 13:21:25 UTC 2019



> i would like to start to rsync(1) the OPUS data from Taito to Saga,
> but i am not sure whether i should use the complete '/proj/nlpl/OPUS/'
> directory from Taito?  joerg, are there sensible exception patterns
> that i should use, say to initially limit the replication to file
> formats that we expect will actually be put to use on Saga?  i believe
> we currently have available a total of eight terabytes in our
> norwegian community directory, though truth be told i am not quite
> sure (i believe we can increase our quota).  is there a sensible
> sub-set of OPUS that we could accomodate in, say, four to five
> terabytes on Saga?

I would leave out 

**/info/*
**/parsed/*
**/smt/*
**/tmx/*

There are also several releases for some of the corpora and you probably only need the latest version. The release version is always one level below the sub-corpus name. latest is a symbolic link (I know - not perfect as a solution) to the latest version. Leaving out releases could be important for big corpora like OpenSubtitles and ParaCrawl. Reducing even more could be done by leaving out

**/moses/*
(those can be generated from the XML) - but they are quite handy to quickly start with MT.
Also quite big and not really necessary right now are

OpenSubtitles/v2018/xml/*alt*


> 
> i have also started to update our software installation guide.  with
> the separation of software and data on Puhti (and generally increasing
> diversity in path prefixes across systems), i would like to put more
> emphasis onto establishing environment variables $NLPLCODE and
> $NLPLDATA (and maybe others, as need be).  also, i would like to leave
> open the possibility of other means of installing software than the
> modules system, down the road, hence suggest to add a new top-level
> entry '.../software/modules/', for those packages that are provisioned
> as modules, with module definition files now in
> '.../software/modules/etc/' (for backwards compatibility on Abel and
> Taito, i have created corresponding soft links).
> 
> could you please see whether the moderately updated installation guide
> looks okay?
> 
> http://wiki.nlpl.eu/index.php/Infrastructure/installation/guide
> 
> more soon!  oe





More information about the infrastructure mailing list