Cluster Outage Setptember 2025

Unplanned Outage - September 8, 2025

On Monday, September 8, we planned a full day of downtime for major changes to the cluster software. This included an upgrade from Slurm 24.05 to Slurm 25.05, which requires several steps. The upgrade of Slurm’s database component was time-consuming, and our regular maintenance steps of reboots and driver updates were completed late in the evening.

On Tuesday morning, we discovered that the upgraded Slurm packages were not compiled with a feature we require for correct GPU configuration. We chose what we believed to be the most expedient option to restore full service: A downgrade to the previous known-working version.

We encountered two major difficulties during the downgrade process, which revealed flaws in our deployment and testing process. To reach a known state, we needed to perform an OS reinstall on all compute nodes. This operation exceeded the capacity of our deployment service, and as a result the majority of our systems were left in an intermediate state requiring individual attention. At the same time, due to an oversight with the upgrade plan, a simple downgrade of the controller and database were not sufficient and we were not able to successfully restore from backup. To restore service, we re-deployed the cluster from scratch on a clean database.

By Wednesday morning, the scheduler had resumed normal operation.

We take seriously the impact this outage has had on your projects and are taking steps to improve our testing and deployment processes based on the lessons learned. We expect that future changes will be more reliable and more efficient. Our goal is to provide you with a stable, predictable compute environment, and we are committed to learning from this incident to prevent similar issues in the future.