[NLPL Task Force (A)] Fwd: [abel-users] Planned downtime on Abel and BeeGFS filesystem 4-5 July - and future upgrade plans

Tue Jun 19 16:22:24 UTC 2018

just fyi: a full upgrade of the operating system on Abel is planned
this fall ... what excitement!

oe

---------- Forwarded message ----------
From: Jon Kerr Nilsen <j.k.nilsen at usit.uio.no>
Date: Tue, Jun 19, 2018 at 2:15 PM
Subject: [abel-users] Planned downtime on Abel and BeeGFS filesystem
4-5 July - and future upgrade plans
To: "abel-users at usit.uio.no" <abel-users at usit.uio.no>

Hi all,

Abel and its BeeGFS filesystem is coming of age, but we still see room
for improvements:

1) Currently the BeeGFS is running in two instances, one for home
areas and software modules (/cluster) and one for scratch workspace
and fast project storage (/work). Both of these see high peaks in
metadata load, but the peaks are not correlated in time. The problem
with peaks in metadata load is that the whole file system gets very
slow if the metadata servers get too saturated. However, if we merge
the two instances into one bigger one with more metadata servers to
spread the load we believe that we can significantly increase the
overall performance by reducing the time spent waiting for file
operations. Additionally, having the input and output data from jobs
in one filesystem and namespace might simplify and speed up some jobs.
2) The operating system on Abel, CentOS 6, is as old as Abel, and we
see more and more issues when building/upgrading software modules. To
continue to offer the software modules you need, we need to upgrade to
CentOS 7.

Now, while we of course want to make these improvements as painless as
possible for you users, some downtime will be needed. As a first step
we need to upgrade BeeGFS to the newest major version, BeeGFS 7.0, to
take advantage of some new migration features. This also requires that
all app nodes mounting BeeGFS using the native BeeGFS client will need
a short downtime to upgrade the client. This brings us to the first
planned downtime, starting 8:30 on 4 July, lasting until 16:00 on 5
July, to coincide with electrical work in the machine room.

When this is in place, we will start the migration of data needed to
be able to merge /work and /cluster to one file system. To be able to
do this as smoothly as possible we kindly ask for your help:

- The less data we need to move around, the less downtime we'll need
to merge the data. Hence, if you all clean up as much as possible of
data you no longer need on /cluster/home ($HOME), /work/users
($USERWORK) and /work/projects, we will be much obliged.
- Please always rely on environment variables as we may switch the
target destination at any possible time and on short notice.
- Please let us know beforehand if you need to write big amounts of
data (more than 10TB) to /work, as we will experience periods with
reduced capacity on /work.

As we speak, we are also working hard on building the most used
software modules for CentOS 7 to prepare for upgrading Abel. We will
come back to you with more details, but we are planning a major
downtime for Abel in the beginning of October, where we will (1) merge
/cluster and /work, (2) upgrade to CentOS 7 and switch to xCAT for
provisioning and (3) upgrade to software modules built for CentOS 7.

Best regards,
Jon
on behalf of the Abel team

-- 
Jon Kerr Nilsen, PhD, Head of Group,
Research Infrastructure Services Group, USIT,
University of Oslo, Norway
email: j.k.nilsen at usit.uio.no
mob: +47 40 20 36 59
office: +47 22 84 09 69