Fwd: Status update on filestore and RCS communications
Please circulate to anybody who might find this useful... -------- Forwarded Message -------- Subject: Status update on filestore and RCS communications Date: Tue, 27 Apr 2021 19:23:49 +0000 From: Clifford, Simon J <s.clifford@imperial.ac.uk> Reply-To: Mclean, Andrew <andrew.mclean@imperial.ac.uk> To: Mclean, Andrew <andrew.mclean@imperial.ac.uk> Hello all, This email has two parts. The first is about the ongoing issues with the Research Data Service (RDS), the second is about our email communications. The RDS is what provides around 11 petabytes of data storage space to the HPC cluster's nodes and logins. It is also accessible as a shared drive. For over a year now it has been suffering an assortment of problems which I will now try to explain. In 2019 we implemented a new parallel storage filesystem, called GPFS, designed to scale to our workloads. It immediately exposed problems with some of the infrastructure switches of the cx2 nodes, which caused system instability. These switches should have worked, but didn't. It is not possible to replace, repair or work around them. As a temporary measure a simpler filesystem, NFS, was put in place across the entire cluster. NFS does not perform well at the scale of our systems, but it is in some ways less sensitive to network instability. Most of the issues currently being experienced on the cluster now are due to NFS struggling to keep up. The best solution is to work towards enabling GPFS on the new cx3 nodes leaving older nodes on NFS. The reduced load on NFS should make its lack of scalability irrelevant. Work has been ongoing to implement this; a major blocker is that the new infrastructure has been put into our network using only IPv6 addresses and this causes conflicts with our existing mixed IPv4 / IPv6 kit. A "big bang" implementation of the network redesign would necessitate an outage of several weeks so we have instead been using an incremental strategy in order to keep the system available as much as possible. This has introduced some unforeseen issues which you will have noticed over the last few weeks. ICT Networks, IBM, and others are assisting in resolving these. We anticipate the migration of cx3 to GPFS should be complete before the end of May. Please be reassured that none of these issues will affect the safety of your data. And while they are very frustrating, for most running jobs the 'hangs' are not terminal -- the job will just wait until the storage is available again. Regarding the RCS's communications: it has become apparent, through some feedback that we are not communicating enough to our users. We apologise for this. We are still quite understaffed, and when a crisis appears our instincts are to fix it as soon as possible. Spending time notifying our users beyond a brief line on the status page (https://api.rcs.imperial.ac.uk/service-status) is perceived as time not spent addressing the problem. However, this mailing list is almost unused, apart from reminders of service outages. We intend to use it more on matters that will still be strictly relevant to the cluster and RDS, addressing service issues and software installs as well as maintenance. We will be guided by your feedback on this.
participants (1)
-
David Colling