Fwd: Status update on filestore and RCS communications

29 Apr 2021

      Please circulate to anybody who might find this useful...

-------- Forwarded Message --------
Subject: Status update on filestore and RCS communications
Date: Tue, 27 Apr 2021 19:23:49 +0000
From: Clifford, Simon J <s.clifford@imperial.ac.uk>
Reply-To: Mclean, Andrew <andrew.mclean@imperial.ac.uk>
To: Mclean, Andrew <andrew.mclean@imperial.ac.uk>

Hello all,

This email has two parts.  The first is about the ongoing issues with
the Research Data Service (RDS), the second is about our email
communications.

The RDS is what provides around 11 petabytes of data storage space to
the HPC cluster's nodes and logins.  It is also accessible as a shared
drive.  For over a year now it has been suffering an assortment of
problems which I will now try to explain.

In 2019 we implemented a new parallel storage filesystem, called GPFS,
designed to scale to our workloads.  It immediately exposed problems
with some of the infrastructure switches of the cx2 nodes, which caused
system instability.  These switches should have worked, but didn't.  It
is not possible to replace, repair or work around them.

As a temporary measure a simpler filesystem, NFS, was put in place
across the entire cluster.  NFS does not perform well at the scale of
our systems, but it is in some ways less sensitive to network
instability.  Most of the issues currently being experienced on the
cluster now are due to NFS struggling to keep up.

The best solution is to work towards enabling GPFS on the new cx3 nodes
leaving older nodes on NFS.  The reduced load on NFS should make its
lack of scalability irrelevant.  Work has been ongoing to implement
this; a major blocker is that the new infrastructure has been put into
our network using only IPv6 addresses and this causes conflicts with
our existing mixed IPv4 / IPv6 kit.  A "big bang" implementation of the
network redesign would necessitate an outage of several weeks so we
have instead been using an incremental strategy in order to keep the
system available as much as possible.  This has introduced some
unforeseen issues which you will have noticed over the last few weeks.
ICT Networks, IBM, and others are assisting in resolving these.  We
anticipate the migration of cx3 to GPFS should be complete before the
end of May.

Please be reassured that none of these issues will affect the safety of
your data.  And while they are very frustrating, for most running jobs
the 'hangs' are not terminal -- the job will just wait until the
storage is available again.

Regarding the RCS's communications: it has become apparent, through
some feedback that we are not communicating enough to our users.  We
apologise for this.  We are still quite understaffed, and when a crisis
appears our instincts are to fix it as soon as possible.  Spending time
notifying our users beyond a brief line on the status page
(https://api.rcs.imperial.ac.uk/service-status) is perceived as time
not spent addressing the problem.  However, this mailing list is almost
unused, apart from reminders of service outages.  We intend to use it
more on matters that will still be strictly relevant to the cluster and
RDS, addressing service issues and software installs as well as
maintenance.  We will be guided by your feedback on this.

David Colling

tags

participants (1)