Fail Loudly, Or Don't Fail At All

Summary

  • Accept that the products and services we build will inevitably fail despite our attempts to avoid and manage for failure.
  • The cost of failing, especially failing poorly, isn't necessarily intuitive to product and engineering organizations because it isn't apparent until it occurs, while the cost that occurs due to a lack of a feature is apparent every time a salesperson loses a sale.
    • There is a definite cost to failing silently, which can be calculated, especially at scale.
  • When products do fail, as they inevitably will, we should ensure that they recover gracefully, and if that is not possible, that they alert either the user or the provider so they can be resolved as quickly as possible.
  • Failing loudly, and attempting to minimize the time between failure and alert, is critical as it minimizes the potential damage incurred to both user and provider.
    • By contrast, failing silently is expensive!

Detailed Discussion

  • When building a product or service, it is important to think not only about ensuring a given system scales appropriately by provisioning enough resources, but also what occurs when events outside of our control occur and disrupt the availability of the products/services we are providing.
  • Thinking about graceful failure isn't always intuitive for product and engineering teams, because in most cases, those teams are oriented towards building new features, capabilities and solutions. Graceful failure planning can take a backseat, just as security often can, because the cost of failure or a security breach isn't apparent until it occurs, while the cost of not having a given feature or solution is readily and consistently apparent in every sales call or product demo.
  • Even less apparent is designing for graceful failure, as this requires not only understanding where and how your systems can fail, but spending time on additional features or processes to gracefully handle those failures.
  • If your team is operating at a significant enough scale, with enough errors occurring for whatever reason, it is possible to argue for resource allocation to build graceful failure processes because you readily understand the cost of graceless failure.

Examples

Let's run through a couple of examples of graceless failure that both occurred to me recently.

  • Bank Bill Pay
    • Issue - My bank's bill pay feature is normally extremely convenient, when it works correctly. In this instance, I had it scheduled to pay a bill on a recurring basis by sending a check to an individual each month. In this example, all of my outstanding bills/checks had been canceled because a piece of mail from my bank was returned to them as undeliverable. As a consequence, I had to pay a late fee as well as suffer potential reputation damage as a late payer.
    • Failing Quietly - I was never notified that a piece of mail was undeliverable, aside from being notified via mail. Notice any issue there? If I am being notified via mail at a theoretically undeliverable address, I will never be notified. My bank also failed to call, secure message, or email me that this piece of mail had failed to be delivered or that its failure would affect the BillPay application and outstanding checks. I was only notified when my check bounced by the receiver.
    • Consequences - As a result of never being notified, I had to call into my bank (cost to them to service the call), as a result of their guarantee, they compensated me for the late fees incurred plus additional grace compensation ($100 USD), and they suffered reputation damage.
      • Direct cost of this one incident for one customer is likely $125-$150. Indirect cost is likely between $200-$500.
    • Ideal Scenario - If a piece of mail was undeliverable, I should have been notified on all channels that this event A occurred, that it would affect B products, and that to resolve this issue, I should call C contact. If that had occurred, I likely could have resolved it with a single phone call and my bank wouldn't have incurred a direct cost or reputation cost. Given my bank has millions of customers, if even one of these errors occurs for each customer once per year, the bank is looking at millions in costs that could have gone straight to their bottom line as profit.
  • Credit Card Fraud
    • Issue - Credit cards are great financial tools as they limit your exposure to fraud, which occurs constantly. In this instance, one of my credit cards was flagged for fraud and canceled with a replacement sent out.
    • Failing Quietly - I was never notified via any channel of the cancellation or that a replacement was sent out, I only found out when I went to use my card and the charges failed to go through. I was also never sent notification of where the new credit card was sent or a tracking number to ensure I received it, which I ultimately never did and was never able to find out where the card was sent.
    • Consequences - My charges when trying to use the card failed to go through and I needed to take time to fix the issue with the CC issuer.
      • Direct Costs - In addition to direct costs of sending out multiple new cards and handling my customer support calls (~$50), this CC issuer also suffered unnecessary reputation damage for poorly handling the card replacement procedure.
    • Ideal Scenario - Credit card company identifies that a card needs to be canceled, calls/direct messages/emails the card owner that the card has been canceled, alongside the tracking number or estimated delivery date for the new card, with a number to call with any questions or concerns. This entire scenario could be automated with very little relative cost to the company.

Primary Takeaway - Failing silently is expensive!