Blameless postmortems

Completed

Organizations that practice DevOps want to view mistakes and errors with a goal of learning. Having blameless postmortems on outages and accidents are part of that goal.

Having a just culture means that you're making an effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure and on the decision-making process of individuals close to the failure, an organization can come out safer than it would if it had punished the people involved.

A blameless post-mortem means that engineers whose actions have contributed to an accident can give a detailed account of:

  • What actions they took at what time.
  • What effects they observed.
  • What expectations they had.
  • What assumptions they made.
  • Their understanding of the timeline of events as they occurred.

It's important that they can give this detailed account without fear of punishment or retribution.

An engineer who thinks they're going to be reprimanded has no incentive to give a realistic, accurate account of the problem. Not understanding how an accident occurred all but guarantees that it will happen again; if not with the original engineer, then with someone else.

"We must strive to understand that accidents don't happen because people gamble and lose. Accidents happen because the person believes that:

...what is about to happen is not possible, ...what is about to happen has no connection to what they are doing, ...or that the possibility of getting the intended outcome is well worth whatever risk there is."

Erik Hollnagel

Allow engineers to own their own stories

A funny thing happens when engineers make mistakes and feel safe when giving details about it. They're not only willing to be held accountable, they're also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the ones with the most expertise when it comes to the error. They ought to be heavily involved in coming up with the remediation.

How do I enable a "just culture"?

  • Encourage learning by having blameless postmortems on outages and accidents.
  • Remind yourself that the goal is to understand how an accident could have happened in order to better equip ourselves from it happening in the future.
  • Gather details from multiple perspectives on failures and don't punish people for making mistakes.
  • Instead of punishing engineers, give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures.
  • Enable and encourage people who do make mistakes to be the experts on educating the rest of the organization on how not to make them in the future.
  • Accept that there's always a discretionary space where humans can decide to act or not act, and that the assessment of those decisions lies in hindsight.
  • Accept that Hindsight bias can cloud our assessment of past events, so work hard to eliminate it.
  • Accept that the Fundamental attribution error is also difficult to escape, so focus on the environment and circumstances people are working in when investigating accidents.
  • Strive to make sure that the blunt end (for example, boards or senior leadership) of the organization understands how work is actually getting done. Compare this to how they imagine it's getting done, through Gantt charts and procedures on the sharp end (for example, engineers and technology).
  • The sharp end must inform the organization where the line is between appropriate and inappropriate behavior. This isn't something that the blunt end can come up with on its own.

Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.