Yes, just like Once Upon A Database, it happens in real life ...
Incident Report for <redacted>
On Tuesday, September 5, 2017, the <redacted> platform was impacted from 15:56 to 16:53 Eastern Time (ET). During this time, customers and their visitors could not access websites or any other <redacted> service. We recognize the severity of this incident and we sincerely apologize to our customers and their visitors. We now fully understand the root cause of the issue and have resolved it.
TIMELINE AND ROOT CAUSE
When our systems were impacted at 15:56 ET on Tuesday, September 5, 2017, our monitoring system alerted our engineers, who responded immediately.
Our initial investigation revealed that the system running our main <redacted> database was under extremely high load, following a planned and routine operation on the database.
Initial attempts to failover database systems and restart the services were unsuccessful in fixing the issue.
The failure of these remedies and techniques, which we’ve used in the past, was unprecedented. Our team then began a more thorough analysis to identify and explain the problem.
At 16:19 ET, we concluded that the planned removal of an index had a disastrous effect on the database’s performance.
Our investigation showed that another index, assumed to have already been in existence, was missing.
A few minutes later at 16:23, our engineers executed the commands for building the missing index but due to the high load on our database systems, this process was going very slowly.
At 16:34, we took further action by disabling all the services across our systems fleet that were demanding database resources.
This had the desired effect on the database system load and accelerated the index creation. The index finished building and we started seeing recovery at 16:46. Finally, at 16:53, our platform and the services it provides were fully recovered and operating normally.
The engineering team has reviewed this incident and is implementing a series of changes to our tests and procedures for index change and removal to prevent this from happening again in the future.
A FINAL WORD
We take all outages very seriously and we are always working to improve our service’s reliability and uptime. Like many of our customers, we rely on <redacted> for our own business and livelihoods. This event was an especially difficult one for us, as we fell short of the high standards we set for ourselves and our service.
We hope that by being transparent around the causes, conclusions, and learnings from incidents such as this one, we can continue to build trust with our customers and offer reassurance that the reliability of all <redacted> products remains our number one priority.
<Redacted> | SVP Engineering | <Redacted>
Posted about 6 hours ago. Sep 07, 2017 - 10:33 EDT
This incident has been resolved and we have confirmed that all systems are operational.
Posted 2 days ago. Sep 05, 2017 - 17:14 EDT
A resolution has been implemented and we are monitoring closely.
Posted 2 days ago. Sep 05, 2017 - 16:51 EDT
We have identified an issue with our database layer and we are implementing a resolution.
Posted 2 days ago. Sep 05, 2017 - 16:13 EDT
We are investigating connectivity issues related to most <redacted> sites. We will provide more information as soon as possible.
Posted 2 days ago. Sep 05, 2017 - 16:02 EDT