All Systems operational.

Systems

Incidents

2018-04-17 11:56 UTC resolved Research File Share
Storage platform outage

RFS is currently not available, we are investigating the issue. We will update you shortly on the status and expected recovery time. We apologise for the inconvenience that this is causing

Update 2018-04-17 14:11 UTC

The service has been restored. Please note that the service is currently running in downgraded mode with snapshot offsite replication disabled. Initial investigation has shown that the crash was caused by a failure of the replication service. The investigation also uncovered that snapshots and replication services stopped working on April the 10th and this was not reported by the storage platform monitoring service. This means that we have no snapshot coverage for the last 7 days. We have escalated this issue with the storage vendor with the aim of restoring replication as soon as possible and fixing the root causes. In the meantime we have enabled snapshot service without offsite replication. We apologise for the interruption of the service and thank you for you patience.

2018-04-17 11:00 UTC resolved Research Data Store
RDS outage

An issue with the storage platform has caused a short outage of a RDS storage.

2018-03-24 11:00 UTC resolved Research File Share Research Cold Store Research Computing Cloud
Storage platform outage

An issue with the storage platform has caused an outage of a number services, we are investigate the root cause and will update you as soon as we know more information.

Update 2018-03-24 23:11 UTC

Access to the services has been restored. We have identified a problem with one of the storage servers which requires further investigation with the vendor support team. We are currently running with no failover capability so please consider services are "at risk". Please let us know if you notice any issues, thank you for your patience and we apologise for any inconvenience caused by the service disruption.

2018-03-16 16:15 UTC resolved Research File Share Research Cold Store Research Computing Cloud
Power work at Soulsby server room

Work in the Soulsby Building to replace the current single UPS unit with a new pair of UPS units. This work will simultaneously remove a single point of failure, replace out-of-date technology and enable routine maintenance to be completed on the UPS system.

  1. This will require two outages:

first, on Sunday 18 March, when the existing UPS will be disconnected.

second, on Sunday 25 March, when the new UPS system will be connected.

  1. Affected services:

Research File Share - offsite replication outage

Research Cold Store - offsite replacation outage

Research Computing Cloud - Soulsby AZ outage

  1. Outage will start on Saturday evening preceeding the Sunday 18 and 25 of March.
2018-02-28 16:15 UTC resolved Research File Share Reseach Data Store Research Cold Store
Storage Services Outage

There is currently as issue with connectivity into the service that we are aware of and urgently working to resolve. The issue is networking based and the underlying storage is unaffected.

Currently all connections will be hanging trying to re-establish communication with the service.

We apologise for the inconvenience and are working to restore connectivity as soon as possible. I'll update with more information as we progress.

Update 2018-02-28 22:40 UTC

The access to storage services has been recovered

The issue has not affected the underlying storage in any way, but it was rather an issue with our virtualised routers and SSH/SFTP gateway nodes that provide access to the platform. Around 16:06 today a software update to portions of the virtualisation platform on which these gateway nodes were running, failed in an unanticipated way and disrupted connectivity to these services.

We apologise again for the ongoing disruption. We will be reviewing the incident to ensure steps are taken to prevent this type of failure happening again.

2018-02-28 16:15 UTC resolved Research Cloud Platform
WBIC HPHI Platform network interruption

We're aware of a networking interruption to the cluster, and to other OpenStack-hosted resources and working on restoring access.

Update 2018-03-01 00:40 UTC

This interruption to networking on the OpenStack cloud should now be resolved, and WBIC-HPHI reachable again. Apologies for the unexpected disruption to open sessions and the inconvenience that this may have caused.

The cause of the interruption was tracked down to interference on compute host software-defined switches by an automated upgrade director, which was performing an upgrade of the OpenStack control plane software at the time.

2018-02-21 10:00 UTC resolved Research Cloud Platform
RCP software upgrade Tuesday 27th February 10:00

Please note there will be an upgrade to the control plane software in use on the OpenStack-based Research Cloud Platform, commencing 10:00 Tuesday 27th February.

Running instances shouldn't be affected during the upgrade process, but inter-instance and instance-external network connectivitity may be affected while the upgrade work progresses, and as such the service should be considered as in an at-risk period during this time

2018-02-20 10:00 UTC resolved CSD3
CSD3 maintenance Tuesday 20th February 10:00

Please note that there will be maintenance on Tuesday 20th next week commencing 10:00

At 10:00 all CSD3 login nodes will reboot so please make sure you have saved all files, quit all applications including remote desktops cleanly and logged off before 10am on Tuesday. Darwin/Biocloud login nodes and other non-migrated private login nodes will not be affected (unless a severe security issue affecting RHEL6 becomes known between now and Tuesday).

The login nodes should return quickly but job scheduling on both Darwin and CSD3 will be suspended for a few hours to allow us to investigate the recent "Remote I/O error" problems that have been appearing from time to time. Also some updates to the lustre servers will be applied. Running jobs should continue to run, but may experience brief pauses (these pauses will also be visible on Darwin and CSD3 login nodes). During maintenance the service should be considered at risk since some of the work may have unexpected effects.

Update 2018-02-20 11:51 UTC

Access to the CSD3 login nodes is now available again, with the exception of login-e-5 which has a disk problem (this means there is a chance that attempts to connect to login-gpu will hang, if this happens to you please retry connecting to login-e-1.hpc.cam.ac.uk explicitly).

Please note that maintenance is ongoing and the job queues on both CSD3 and Darwin are currently suspended. Some brief pauses in responsiveness may be observed as work on the filesystem servers continues.

2018-02-01 10:00 GMT UTC resolved Research Cold Store
RCS tape system issue

The tape robot in the primary library has failed. Currently, restoring data stored on tape system will not be possible until the robot mechanism is repaired

Update 2018-02-03 12:00 UTC

The Tape system has been repaired. Research Cold Store is now fully operational.

2018-02-01 10:00 UTC resolved Research File Share
RFS replication issue

RFS replication service between data centers has failed, and we are investigating the root cause. RFS snapshots for the period 28-Jan until 1-Feb will be unavailable.

Update 2018-02-01 22:00 UTC

The replication service has been restored. Please note that the snapshots between 28-Jan and 1-Feb are not available. We apologise for inconvenience that this may have caused.