COVE Performance issues
For the week ended Friday 10/26 PBS suffered a series of operational events on on the MySQL database cluster which services the COVE API. As a result, over the course of a 36 hour period we saw mutiple failures of the master database which resulted in corruption of the data image and thereby prevented the promotion of RDS slaves. This sequence of events happened multiple times over the source of the 36 hour period and resulted in a total downtime of 7 hours and 31 minutes and a total uptime of 94.93%.
During these periods of downtime, multiple applications may have been partially degraded depending on the duration of the individual outage. The applications include mobileweb, iOS apps, the COVE portal. During these outages, all publishing systems (Merlin) were operational although newly published assets may have been delayed in appearing in the COVE API.
During the outage we worked closely with Amazon RDS support and were able to identify the root cause as bug in the underlying MySQL engine that we were using. This bug was exposed during several memory intensive operations (e.g. large temp table joins). To address the issue we upgraded to a newer version of the MySQL engine as well as refactored several of the MySQL queries. All of these fixes were in place by approximately Noon on Friday 10/26.
Since that time we have not seen a recurrence of the behavior and uptime has been exactly 100%.
During the month of September the total uptime was 99.99% as a result of 3 outages that lasted under 1 minute.