The Value in Service Failure

During my time at Amazon, my team deployed some bad code that caused a multi-hour customer-impacting service outage. I ended up being responsible for interviewing team members, establishing facts, recording lessons learned, and summarizing the findings in a report. I learned a little about engineering, and a lot about people — enough for me to share some research and dispense some concrete advice for the future.

While it is instinctive to downplay failure, smart leaders and organizations are prepared to squeeze value out of a negative outcome. What value, you ask?

Indeed, the opportunity cost of fumbling a failure is high.

Amazon’s process for learning from failure requires the team responsible to write a Correction of Errors document according to a template. The artifacts of the process include the COE document, Corrective Actions, and Lessons Learned. COE documents are then analyzed in aggregate; if a pattern is found, learnings are disseminated company-wide and new policies are created to break the cycle. No matter what, COE documents are always accessible by every employee of the company (unless they divulge company-secret information). This concept was easily my favourite part of Amazon’s culture, but I can tell you from experience it’s harder to execute than it sounds.

My investigation for Amazon highlighted some important truths:

Blame is natural, contagious, and destructive. Social psychologists call it self-serving attribution bias — when our ego is threatened, we tend to attribute good outcomes to ourself, and bad outcomes to others. This natural tendency helps us preserve our self-esteem in the face of failure, but hinders our ability to learn from our mistakes. In a group setting, blame ricochets through the team, each person instinctively deflecting to the next until all trust is lost. Morale and team performance suffer.

Self-assessment is constructive. Self-assessment is defined academically as assessing and evaluating the quality of one’s own work. In our scenario, this means asking the question, “What could I have done differently to improve the outcome?” (Note that the question is not “What did I do wrong?”) The practice of self-assessment is beneficial to our learning and development at any time, but I think it is especially important in the wake of a failure. On one hand, those not directly involved in the event learn how to better support their colleagues and improve the performance of the team. On the other hand, those who believe that they are directly responsible for the event feel supported, rather than threatened. Self-serving attribution bias is a threat response, thus the psychological safety induced by colleagues’ support means that we are less likely to pass the buck, and more likely to assess our own actions and share the lessons we learn.

Human error is not a root cause, it’s a law of nature. Attributing a failure event to a human is about as productive as blaming gravity when you trip and fall, because it is not possible to eliminate human error. Still, most organizations cannot completely remove human influence from their mission-critical systems. That’s why the goal of any post-mortem investigation must focus on finding opportunities to reduce the likelihood and impact of human error. This is much more difficult than laying blame, but it is the only way to extract the full value of a failure.

It is clear that some elements of the human condition can be stumbling blocks in the wake of a negative event. What, then, is the right post-failure strategy?


The Right Post-Failure Strategy, Probably

What follows is a post-failure analysis framework. Inspired by Amazon’s COE process, shaped by my personal experience, and generalized to be broadly applicable, it aims to account for the realities of bias, emotion, and fallacy to extract the maximum value from a service failure. It is my attempt to organize the above information into actionable advice, and I intend to use and improve it as my career progresses. I hope that you the reader can also find some use for it.

How to Use It

Consider first converting this framework into a template specific to your organization or domain. It should be adjusted to fit your needs, but be careful to preserve the core structure:

Phases. Complete each of the four phases one at a time, in series. The individual tasks that make up each phase may be done in parallel, but do not start the next phase until the team agrees on the content of the previous one.

Rules. Follow them to avoid the negative consequences of bias and the human tendency to forget things. The first rule is always do not blame people, things, or the environment. Note that this does not mean you can’t mention human error or other environmental context, only that you cannot declare them to be at fault for a negative outcome.

Reviews. The analysis should be reviewed at the end of each phase so that the collective sum of the teams knowledge is leveraged and to ensure that each member of the team (including the management team) feels that they had an equal opportunity to correct the record. Four all-hands review meetings may seem painful, but I promise you the alternative is worse: you will be subject to months of change requests from various parties, each contradicting the other in a vicious cycle until despair sets in and team morale is but a faint memory. Yes, I did experience this, and no, it was not effective.

Corrective Actions and Lessons Learned. Corrective Actions are concrete tasks that can be executed to prevent a recurrence of the failure. Lessons Learned are items of knowledge that can be applied in the future to improve the performance of the team. These are the principle artifacts of this process, so it is important that they are clear and concise. Ensure, also, that you follow up by executing the actions, disseminating the lessons, and updating your organization’s best practices in order to avoid similar outcomes in the future.

Phase 1: Factual Information

Rules:

Capture Data. Logs, support tickets, metrics, photos, code reviews, designs, documentation, etc. Whatever data you have that is related to the failure should be collected. Digital data should be backed up – often things like logs are only held for a short time.

Create a timeline. The timeline should account for all events, decisions, and actions taken related to the impact, detection, mitigation, and primary cause. Make sure that there are no gaps in the timeline. The maximum acceptable gap depends on the domain, so it’s up to you. I would start with 10 minutes.

Visualize metrics. Create graphs to exhibit data relevant to your failure.

Graph Rules:

Quantify the impact and recovery. Use concrete numbers and precise times. Write about how the service’s end-consumer would have experienced the impact, not how your team experienced it.

Bad:

Our logs showed that all calls failed for an hour and a half in the evening. We received 61 complaints during that time.

Good:

3,452 customers tried and failed to make 3,975 calls between 22:04 and 23:32 ET. When a customer’s call failed, they heard, “Sorry, there was a problem. Please try your call again later.” 48 customers called the help line and 13 used the online contact form to register complaints relating to this outage between 22:12 and 23:35 ET.

Quantify the response. Use concrete numbers and precise times. Document:

Note that the direct cause is distinctly not the root cause.

Direct cause
The most direct cause of the failure. This might be human error.
Root cause
The underlying cause(s) of the failure. This is never human error. Refer to Phase 3.

Review. Gather the team and review the facts of the event. Everyone involved should have the opportunity to inspect and correct the record before assessment begins. Each fact must be corroborated by data. Reviewers should, in particular, ensure that the rules have been followed and that checklists have been completed. If there is any data missing, cut a Corrective Action to ensure that a system is in place to gather that data in the future.

Phase 2: Self-Assessment

Rules:

Detection. What could we have done differently to detect this failure sooner? Could we have detected it before it had any impact?

Mitigation. What could we have done differently to mitigate the impact of this failure faster?

Diagnosis. What could we have done differently to diagnose the direct cause of the failure faster?

Prevention. What could we have done differently to prevent the direct cause of the failure? Could we have prevented it before it had any impact?

Blast Radius. What could we have done to reduce the number of people affected by the impact of this failure? In this section, be extra careful not to discuss things that belong under another heading. Focus only on strategies that could have reduced the number of people affected given that the impact, detection, and mitigation played out the way they did.

Review. Ideally, the whole team was already involved in brainstorming this content. Even so, gather everyone when it is complete and have them certify that it is accurate and complete. Reviewers should ensure that the rules have been followed, and that Corrective Actions and Lessons Learned have been recorded.

Phase 3: Root Cause Analysis

Rules:

The purpose of Root Cause Analysis (RCA) is to analyze the how-and-when data from Phase 1 along with the assessment from Phase 2 in order to find the sometimes-unintuitive fundamental reasons why the failure occurred. A dangerous misconception about RCA is that it’s goal is to find a single root cause. In fact, the goal is to find all of the root causes so that the maximum amount of value may be extracted from the failure being investigated.

Tip: If you don’t already have an RCA strategy, start with the 5-whys, but make sure you are aware of its common pitfalls. One tactic is to identify all of the reasons why something happened, rather than just one. Then, ask why again for each of those reasons, again finding all of the reasons for each, and so on five or more times. This produces a “why tree”, with each branch representing a series of contributing causes, one or more of which is a root cause. For an even more robust analysis, have multiple small groups try it separately and combine their results at the end.

Another tip: It may be that there were multiple direct causes for the failure, or that there were secondary failures with separate direct causes, such as a failure to mitigate the problem in a reasonable amount of time. If this is the case, you will benefit from doing a separate RCA for each direct cause, rather than trying to bundle it all into one complex analysis.

The term root cause has many definitions, but the best one I’ve found is this:

Root causes are underlying, are reasonably identifiable, can be controlled by management, and allow for generation of recommendations.

— Rooney and Vanden Heuvel in Root Cause Analysis For Beginners, 2004

Root causes are underlying, in that they directly or indirectly caused the failure; reasonably identifiable, in that they don’t cost too much time or money to uncover; controllable, in that things like human error, the weather, or the economy are not valid root causes; and allow for recommendations, in that each root cause can be easily converted into a recommendation for a concrete corrective action.

The benefit of this definition is that it precludes attribution bias and forces us to self-assess and consider only the factors that we can influence. In this case the self-assessment is applied at the group level, but the benefit is the same. Once each root cause is identified, create a concrete Corrective Action that would prevent a recurrence of the problem. If you find this hard, it may be because what you found was not a root cause — reconsider the above definition and adjust your analysis.

Review. Gather the team once again, and ensure that each member agrees that all root causes have been found. Reviewers should ensure that the rules have been followed, and that Corrective Actions and Lessons Learned have been recorded.

Phase 4: Report

Rules:

In this phase, summarize your findings in a report that can be disseminated throughout the organization. This phase is important because the report is the vehicle by which the knowledge is shared, compounding the value you’ve extracted from the negative outcome.

Ensure that the document is written to maintain a single cohesive narrative. If the report is to be written by more than one person, assign one person to be responsible for this quality. A cohesive narrative means using consistent tone, tense, and terminology in order to reduce the cognitive burden on the reader.

Organize the report into approximately the following sections:

Executive Summary. Summarize the entire document in a way that stands on its own. Nothing in the rest of the document should be a surprise if the reader has read this section. Include at least a paragraph for the following:

It helps to write this section last.

Background. Include as much technical background as needed to allow a reader with no prior knowledge of the service to read and understand the document. It is appropriate to include service descriptions, architectural diagrams, designs, code listings, etc.

Incident. A full, factual description of the event, including the timeline and data visualizations from phase 1. Include references to Corrective Actions only if they relate to missing data and how better to capture it in the future.

Detection. Information from phases 1 and 2 about how the failure was detected and what could have been done to detect the problem earlier. Include references to Corrective Actions and Lessons Learned.

Mitigation. Information from phases 1 and 2 about how the impact to the end-consumer was mitigated, what could have been done to mitigate the problem faster, and what could have been done to reduce the blast radius. Include references to Corrective Actions and Lessons Learned.

Direct Causes. Information from phases 1 and 2 about what the direct causes where and what could have been done to prevent them. It is appropriate to attribute the failure to human error in this section, as long as you don’t declare that person to be responsible. Include references to Corrective Actions and Lessons Learned.

Root Causes. Information from phase 3 about the underlying root causes of the failure. Include references to Corrective Actions and Lessons Learned.

Corrective Actions and Lessons Learned. Enumerate all of the Corrective Actions and Lessons Learned in one place. It may help to do this first,

Appendices and Attachments. This section should contain for posterity all information that would otherwise disrupt the narrative of the document if it were included in-line. Include data supporting the facts from Phase 1, such that patterns of failure over time can be analyzed later.


Thanks for sticking around until the end of what became a rather long-winded article. I hope that you have learned as much in reading it as I did in writing it. Good luck!