- CSD3 operational
- Wilkes2 operational
- Darwin operational
- Research Data Store (RDS) operational
- Research File Share (RFS) operational
- Research Cold Store (RCS) operational
- Research Cloud Platform operational
Storage platform outage
RFS is currently not available, we are investigating the issue. We will update you shortly on the status and expected recovery time. We apologise for the inconvenience that this is causing
Update 2018-04-17 14:11 UTC
The service has been restored. Please note that the service is currently running in downgraded mode with snapshot offsite replication disabled. Initial investigation has shown that the crash was caused by a failure of the replication service. The investigation also uncovered that snapshots and replication services stopped working on April the 10th and this was not reported by the storage platform monitoring service. This means that we have no snapshot coverage for the last 7 days. We have escalated this issue with the storage vendor with the aim of restoring replication as soon as possible and fixing the root causes. In the meantime we have enabled snapshot service without offsite replication. We apologise for the interruption of the service and thank you for you patience.
An issue with the storage platform has caused a short outage of a RDS storage.
Storage platform outage
An issue with the storage platform has caused an outage of a number services, we are investigate the root cause and will update you as soon as we know more information.
Update 2018-03-24 23:11 UTC
Access to the services has been restored. We have identified a problem with one of the storage servers which requires further investigation with the vendor support team. We are currently running with no failover capability so please consider services are "at risk". Please let us know if you notice any issues, thank you for your patience and we apologise for any inconvenience caused by the service disruption.
Power work at Soulsby server room
Work in the Soulsby Building to replace the current single UPS unit with a new pair of UPS units. This work will simultaneously remove a single point of failure, replace out-of-date technology and enable routine maintenance to be completed on the UPS system.
- This will require two outages:
first, on Sunday 18 March, when the existing UPS will be disconnected.
second, on Sunday 25 March, when the new UPS system will be connected.
- Affected services:
Research File Share - offsite replication outage
Research Cold Store - offsite replacation outage
Research Computing Cloud - Soulsby AZ outage
- Outage will start on Saturday evening preceeding the Sunday 18 and 25 of March.
Storage Services Outage
There is currently as issue with connectivity into the service that we are aware of and urgently working to resolve. The issue is networking based and the underlying storage is unaffected.
Currently all connections will be hanging trying to re-establish communication with the service.
We apologise for the inconvenience and are working to restore connectivity as soon as possible. I'll update with more information as we progress.
Update 2018-02-28 22:40 UTC
The access to storage services has been recovered
The issue has not affected the underlying storage in any way, but it was rather an issue with our virtualised routers and SSH/SFTP gateway nodes that provide access to the platform. Around 16:06 today a software update to portions of the virtualisation platform on which these gateway nodes were running, failed in an unanticipated way and disrupted connectivity to these services.
We apologise again for the ongoing disruption. We will be reviewing the incident to ensure steps are taken to prevent this type of failure happening again.
WBIC HPHI Platform network interruption
We're aware of a networking interruption to the cluster, and to other OpenStack-hosted resources and working on restoring access.
Update 2018-03-01 00:40 UTC
This interruption to networking on the OpenStack cloud should now be resolved, and WBIC-HPHI reachable again. Apologies for the unexpected disruption to open sessions and the inconvenience that this may have caused.
The cause of the interruption was tracked down to interference on compute host software-defined switches by an automated upgrade director, which was performing an upgrade of the OpenStack control plane software at the time.
RCP software upgrade Tuesday 27th February 10:00
Please note there will be an upgrade to the control plane software in use on the OpenStack-based Research Cloud Platform, commencing 10:00 Tuesday 27th February.
Running instances shouldn't be affected during the upgrade process, but inter-instance and instance-external network connectivitity may be affected while the upgrade work progresses, and as such the service should be considered as in an at-risk period during this time
CSD3 maintenance Tuesday 20th February 10:00
Please note that there will be maintenance on Tuesday 20th next week commencing 10:00
At 10:00 all CSD3 login nodes will reboot so please make sure you have saved all files, quit all applications including remote desktops cleanly and logged off before 10am on Tuesday. Darwin/Biocloud login nodes and other non-migrated private login nodes will not be affected (unless a severe security issue affecting RHEL6 becomes known between now and Tuesday).
The login nodes should return quickly but job scheduling on both Darwin and CSD3 will be suspended for a few hours to allow us to investigate the recent "Remote I/O error" problems that have been appearing from time to time. Also some updates to the lustre servers will be applied. Running jobs should continue to run, but may experience brief pauses (these pauses will also be visible on Darwin and CSD3 login nodes). During maintenance the service should be considered at risk since some of the work may have unexpected effects.
Update 2018-02-20 11:51 UTC
Access to the CSD3 login nodes is now available again, with the exception of login-e-5 which has a disk problem (this means there is a chance that attempts to connect to login-gpu will hang, if this happens to you please retry connecting to login-e-1.hpc.cam.ac.uk explicitly).
Please note that maintenance is ongoing and the job queues on both CSD3 and Darwin are currently suspended. Some brief pauses in responsiveness may be observed as work on the filesystem servers continues.
RCS tape system issue
The tape robot in the primary library has failed. Currently, restoring data stored on tape system will not be possible until the robot mechanism is repaired
Update 2018-02-03 12:00 UTC
The Tape system has been repaired. Research Cold Store is now fully operational.
RFS replication issue
RFS replication service between data centers has failed, and we are investigating the root cause. RFS snapshots for the period 28-Jan until 1-Feb will be unavailable.
Update 2018-02-01 22:00 UTC
The replication service has been restored. Please note that the snapshots between 28-Jan and 1-Feb are not available. We apologise for inconvenience that this may have caused.