Yesterday at 15:40 our monitoring services identified an issue with degraded performance on 1 of our Hypervisors hosting a small subset of MSD instances.
During the first 20 minutes yesterday we quickly identified the root cause and began notification of effected customers, whilst in parallel working to resolve the incident.
We had fully recovered at 16:30 and all instances were back to normal.
We have since moved everything off the offending hypervisor as a precaution so that we can now look at ensuring the effected hypervisors stability before its used again.
The root cause of the problem was a disk failure in the raid array and even though this should not have been an issue (we loose a disk in 1 of our many hypervisors on average once every 2 months) it should have used the hot standby and carried on (Infact we lost a disk in this very hyperrvisor at the start of this week, and customers would have been unaware of any issue).
Unfortunately the new disk seems to have had manufacturing defects which meant that it became unusable pretty quickly and because it hadnt finished its rebuild (from the previous disk failure) the hot standbys were not available.
Going forward we have implemented a number of checks to ensure that a disk is error free and ran in prior to being added to the array.