I wrote this by hand before typing it up, and drew a cat. This is the cat.Design Failure Mode and Effect Analysis (DFMEA) is a software engineering technique that can help validate design decisions or improve upon them. It takes your existing design and puts each component and link under a magnifying glass, running it through a what-if scenario. In this post, I will walk through a DFMEA of a fictional website and on-line store for a fictional florist. If you read my other blog, Applied Paranoia you may already be familiar with that application.

The Crassula application uses a static website generated with Jekyll, and an Angular app embedded in that website for the purchasing workflow. The static front-end and angular app are served out of an S3 bucket behind a CloudFront proxy, the back-end for the store uses some serverless functions (lambdas), and S3 bucket to download invoices from, and a NoSQL database. The whole thing is tied together using AWS’ Simple Queue Service and deployed using CloudFormation.

The S3 bucket that contains the front-end (static site and Angular app) is the first point of entry for any customer. The Angular app calls out to the API and implements the workflows, including downloading invoices from the second S3 bucket using SAS URIs, but unlike the static app it can handle temporary failures of the API or the S3 bucket with a modicum of grace.

To do a DFMEA on this application we go through three high-level steps, the last of which has sub-steps. Before we start doing that, though, we should set a goal. For purposes of this analysis, we’ll set a goal of 99.5% uptime. I’ve taken this number more or less arbitrarily, and getting to such a number is outside the scope of this post1, but it will be needed later on.

The steps are:

  1. list all the components in the solution
  2. create a diagram of the components and how they relate to each other
  3. for each component and communications link, determine what failure looks like:
    1. determine how likely it is to fail, and exclude anything that is too unlikely to be worth our time
    2. determine which remaining services are on the critical path for any service that has availability requirements, and exclude anything that isn’t
    3. determine what the failure mode is
    4. determine what the user-visible effect or business impact of failure is
    5. determine how that failure can be detected
    6. determine how the failure can be mitigated, within the design and from outside of the system
    7. determine how the failure can be remediated
    8. determine how failure can be recovered from

The intent is to answer the three questions2:

  • How does it fail?
  • How do I know it failed?
  • What do I do when I know it failed?

There is no question that things wil fail: failure is inevitable. Success is not the absence of failure, but good management of failure conditions. We want to be methodic in this approach, but we also want to quickly exclude anything we can exclude, to avoid wasting our time. Anything we deem isn’t worth our time will be excluded from the analysis as soon as we can do so. We can always come back to it later.

Listing the components

The first step is to list all the components. In the case of this particular application, there are actually quite a few:

  • AWS CloudFront is used as a content delivery network and caching reverse proxy. It has a cache, so most hits on the website will not hit the underlying S3 bucket.
  • An S3 bucket is used to store the Jekyll-generated static site and the Angular-generated one-page store front-end as well as the associated static resources (such as images, style sheets, etc.).
  • AWS’ DNS service is used for the domain and sub-domains.
  • AWS’ PKI is used to secure the site. This is implemented using AWS Certificate Manager
  • An API Gateway is used to serve the store application’s API
  • A second S3 bucket is used to share confidential files (invoices etc.) with the user.
  • Several Lambda functions, all written in Node.js, are used. One of these is an identity-aware proxy, the others are business logic micro-services, and interface micro-services to third-party services.
  • The Amazon Simple Queue Service is used to allow the micro-services to communicate with each other.
  • Two services use a DocumentDB to store information about transactions etc.
  • AWS Cognito is used for identity management.
  • A third-party payment service is used to process payments.

The DevOps components are out of scope for this analysis because if they fail, it prevents the site from being updated for the duration of the failure, but the site is not operationally affected.

Having now identified the components to include in the analysis, we can go on to the next step: creating a diagram of the architecture.

Creating an architecture diagram

We start this step by just placing all the components we just identified on a diagram. It’s usually a good idea to use some type of light-weight software for this, such as draw.io or Lucid Chart. Things may need to move around a bit, so don’t worry about putting things “in the right place” just yet.

Next, we’ll add communications links to the diagram. To complete that picture, we need to add the client application as well as functional dependencies. Because this application is built using a micro-service architecture in which individual micro-services are de-coupled from each other using an message bus (SQS in this case), you’ll see there are actually very few direct dependencies and even fewer direct communications lines. For example, any API call from the client (shown in blue in the diagram) goes into the API gateway, which calls into the identity-aware proxy. That proxy, assuming authentication passes muster, will send a message onto the SQS bus for any POST or PUT request. For GET requests, it will fetch data from the associated DocumentDB, kept up-to-date by the aggregator service. If an entry is not found in the database but is expected to exist, the proxy will send a request onto the SQS bus and return a 503 error (in which case the front-end will retry, giving the services some time to respond and the aggregator to update the database).

As shown in the diagram, all lambda services in the application depend on the SQS service, which all services directly communicate with. Because of this, and because of the de-coupled nature of the application, we don’t really need to look at other types of dependencies for the lambda functions. They do warrant mentioning though (the application isn’t just lambda functions talking to each other over SQS after all) and there is at least one example of each in the diagram.

The first is data flow: from the user’s perspective, the four components they need to exchange data with are the DNS, CloudFront, Cognito, and the API Gateway. DNS and Cognito don’t really depend on anything outside of themselves, but CloudFront depends on the S3 buckets to get its data, and on the certificate management system to implement PKI. For those dependencies, the data flow is actually in the other direction, but I prefer highlighting functional dependencies over highlighting data flow in such cases.

The other flow we need to highlight, and the only dependency that gets red double-headed arrows in the diagram, is a control flow dependency: the identity-aware proxy depends on the Cognito identity management system and, at least for some workflows, calls into that service are functionally synchronous (that is: a call into the proxy will fail if a call into Cognito times out or fails). In the type of architecture we’re looking at here, such tight cooupling is rare – as it should be. That is why those dependencies are highlighted with extra arrows between the identity-aware proxy and Cognito, where the proxy uses Cognito to validate and parse the claims in the bearer token.

Now that we have all the dependencies in our diagram, we can proceed to the next step.

Exclude things too unlikely to fail

We will look at each component in the system design, see how likely it is to fail, and determine whether we should take a closer look at the failure modes. As we’re building the service on AWS, this mostly means looking at SLAs. We’ll exclude anything from our analysis that has an availability that is significantly higher than our 99.5% objective, or for which a failure would be effectively invisible to our users. Once we’re done with this step, we will update our diagram to match. Let’s treat the components in order, starting with CloudFront.

According to the Amazon CloudFront Service Level Agreement AWS guarantees 99.9% average monthly uptime. CloudFront has more than 400 edge locations that, together, constitute a caching content delivery network. Because of the way CloudFront is set up, and because we essentially only use it to deliver static content (we're not using the edge lambda feature nor any of the other more advanced features of CloudFront), CloudFront can be excluded from further analysis.
According to the Amazon S3 Service Level Agreement, the uptime guarantee offered by AWS depends on the service tier used. Following the dependency we highlighted in the previous step, failure of the S3 service can manifest in the following ways:
  • A user trying to download the front-end application and hitting a cache miss in CloudFront also hits a failure in the S3 bucket, resulting in a partial or complete failure to load. This is very unlikely, as it requires both a cache miss and a failure, but the effect is visible.
  • A user trying to download an invoice gets an error and has to retry. This is the most likely visible effect as these files will not be cached by CloudFront.
  • The invoicing lambda fails to store a generated invoice PDF in the bucket. This is high-impact and would possibly require human intervention to manually retry the PDF generation and upload.
S3 has a 99.9% SLA, which is significantly higher than our target, but with these dependencies, the impact of failure is sufficiently for us to keep S3 in the analysis.
AWS' Route53 has a 99.99% SLA. Additionally, DNS is a globally-distributed, caching database with redundant masters, so its failure is excluded from further analysis because it is just too unlikely.
AWS Certificate Manager (PKI)
CloudFront only needs access to the us-east-1 instance of the AWS Certificate Manager at configuration, do a deployment of the stack will fail if the certificate manager is not available, but there is no operational impact in that case. As we're not concerned with deployment failures for this analysis, we can therefore exclude the certificate manager.
AWS API Gateway
The AWS API Gateway has an SLA of 99.95%. If it does fail, the front-end (website and application) is still available and can therefore retry calls into the API if such calls time out or result in an HTTP 5xx error, indicating a failure in the back-end. That does require some foresight on our part, so we will need to include it in the analysis, even if the expected availability is very high.
Lambda functions
The lambda functions are much more likely to fail due to our own code misbehaving than they are due to some failure on Amazon's part, so regardless of the SLA Amazon offers, they need to be included in our analysis.
Simple Queue Service
The SQS bus is the glue that binds all the functions together. It is essential to the functioning of the application, so any failure of the service needs to be mitigated despite its 99.9% SLA.
The application stores all of its data in DocumentDB. The service has a 99.9% SLA for availability, and backup capabilities with restore-from-snapshot functions. It maintains six copies of the data and allows you to control creating more copies and backups. As we use to to store all the data, we will include it in our analysis. As we're not talking about disaster recovery right now, we'll leave the backup capability out of scope, though, and concentrate only on availability issues.
Cognito is used to for Identity and Access Management for the application and is used whenever a user logs in. It has a 99.9% SLA. As it is on the critical path for anything that requires authentication (which is basically everything), we will incude it in our analysis.
Third-party payment service
In this hypothetical application, let's assume we use a third-party payment service that provides a 99.5% SLA. We will include it in our DFMEA because payment is critical to the business.

Update the diagram – concluding the first step

With this in mind, we can now update the diagram and indicate what is, and what isn’t, included in our analysis. The nice thing about this type of diagram is that you can use the same diagram, and much of the same approach, for things like thread modeling and architecture review as well. We obviously won’t do that right now, but it’s good to keep in minde that the diagram we just created will be an asset for that type of analysis.

What we’ve just done is set the service level expectations for each of the services we depend on. We can also represent this as a table (usually a spreadsheet) and show what we intend to exclude from further analysis.

Component Group Type Description SLE Exclude
CloudFront infra CDN CDN used by the application front-end, and for pre-signed URIs 99.9% x
AWS API Gateway infra API Gateway API Gateway used for API management 99.95%  
Identity-aware proxy app Lambda Authenticating entry-point and policy enforcement point 99.5%  
Aggregator app Lambda Micro-service for front-end optimization 99.5%  
Invoicing app Lambda Micro-service for invoice generation 99.5%  
Inventory app Lambda Micro-service for inventory management 99.5%  
Profiles app Lambda Micro-service for customer management 99.5%  
Payment app Lambda Micro-service front-end to third-party payment service 99.5%  
Order app Lambda Micro-service for order management 99.5%  
Simple Queue Service infra Bus Bus used by micro-services to communicate with each other 99.9%  
Data cache data DocumentDB Cache used by aggregator and front-end for optimization 99.9%  
Database data DocumentDB Database used to store all business data 99.9%  
Cognito infra IAM Identity management service 99.9%  
Payment service N/A Service Third-party service used to process credit card payments 99.5%  
Front-end bucket data S3 Data store used to host the front-end 99.9%  
Invoice bucket data S3 Data store used to host invoice PDFs 99.9%  
Route53 infra DNS DNS service 99.99% x
Certificate Manager infra PKI Certificate management service 99.9% x

Validate critical paths

Our next step in the analysis is to determine which of the remaining components are on the critical path of a service that has high availability requirements. In our example, there are four services that we want high availability for: purchasing flowers, invoicing those flowers, updating the inventory, and the payment processing service for flowers purchases.

At this point, we should look at what our PO has documented as user stories for these four services. Let’s take a quick look at a partial list:

  • “As a customer, I want to purchase a bouquet of flowers”
    • “As a customer, I want to browse the gallery of different types of flowers so I can choose which ones I want”
    • “As a customer, I want to choose which flowers I want in my bouquet”
    • “As a customer, I want to buy flowers for a specific event such as a wedding”
  • “As a customer, I want to update my profile”
    • “As a customer, I want to store my credit card information so I don’t have to enter it every time I buy flowers”
    • “As a customer, I want to change my E-mail address”
  • “As a wedding planner, I want to see my previous orders so I can update my bookkeeping”
  • “As a customer, I want to order the same bouquet I did a month ago”
  • “As a business, I want to make sure we get paid for the flowers we shipped”

As we run through the various user stories, we encounter our services:

AWS API Gateway

The API gateway is used, from a user's perspective, to check inventory, to filter what's shown, to request a purchase, and to authorize a payment. In each case, the user uses the front-end, which calls the API to request the purchase and to authorize the payment. The user is not directly involved in inventory management (which will restrict the flowers available for purchase and update the inventory when a purchase is requested, the latter being a side-effect of the request) or invoicing (which is a side-effect of the payment request, but does not have direct user interaction through the API).

In any case, the API Gateway is on a critical path and therefore has to remain in-scope for the analysis.

Identity-aware proxy
The identity-aware proxy gives the user access to everything behind the APIs, and is therefore on the same critical paths as the API gateway itself.
The aggregator is only used as an optimization to prepare the presentation shown by the front-end application. It never interacts directly with the user nor is it used in any of our scenarios. If it fails, either one of two things can happen: a call into the data cache DocumentDB mauy miss, resulting in a 503 error from the proxy amd a "kick" over the SQS bus to update the cache, or a hit on the cache may give stale data. The latter case is manageable by adding a time-out to how long data stays in the cache, which may result in more 503 errors but also more consistency. This essentially becomes a note for the development team but is not an issue for this DFMEA. The aggregator can therefore be excluded from futher analysis.
The invoicing service receives a purchase request and an authorization, both of which will contain enough information to generate an invoice PDF and a message sent back to SQS. This is essentially what our customers want from the invoicing service, which places this micro-service on the critical path for that service.
The inventory service is not directly addressed by the front-end for customers: it will publish updates on the bus picked up by the Aggregator to be put in the data cache. This way of working removes it from the critical path in every case except updating the inventory, which is still enough to keep it in-scope for our analysis.
Whenever something is purchased and whenever someone is invoiced, the database containing user profiles is accessed, but the profile micro-service, which owns the data and is the only one to write to it, is not required for read-only access and therefore has no role in those transactions. It is therefore not on the critical path for our critical functions, from which profile updates are conspicuously missing. So, we can exclude it from further analysis at least until our PO asks us to include profile updates in the critical scenarios.
The payment micro-service is a front-end to a third party service that does the actual payment processing. When it receives an order from the bus it will request payment and post the receipt back on the bus. As such, it is critical for payments to be processed, but it is not critical in any user flow (that is: payment has to be processed before the flowers are shipped, but the order will be accepted pending payment). Regardless of this caveat, it remains in-scope for the analysis.
When a customer decides to order something, a message is sent over SQS, which is received by the order micro-service. This micro-service has read-only access to the inventory and profile databases, and triggers a number of transactions, all of which must eventually succeed for the right flowers to be delivered to the right place, at the right time: it enters the order in its own orders database, it sends a message over the SQS bus for the invoice to be generated, for payment to be processed, and for inventory to be reserved. This may all be the same message, announcing the new order to the world. Each of the services concerned with that event will know what to do with it. In any case, this puts the order micro-service squarely on the critical path.
Simple Queue Service
SQS is the glue that holds the application together: aside from a GET request that can be serviced directly from the data cache, everything goes through SQS one way or another.
Data cache
The data cache is an optimization mechanism, but it is also on the path for any user interactions. While any good optimization mechanism should be optional to the application and any failure of such a mechanism should be no more than a temporary nuisance, we need to include it in the analysis to make sure this is the case.
The database we're referring to here contains three tables: profiles. orders, and inventory. The orders table is written to for purchasing, the inventory table is written to for any change in inventory. and the profiles table, while never written to in any of our critical scenarios, is still accessed in each one of them.
Cognito is the AWS service we use for authentication and authorization, leveraging Cognito's User Pools and potentially third-party identity providers such as Google, Facebook, etc. It is used whenever a user logs in, and to authenticate the user-provided bearer token on every API call. Because of this, it is on the critical path for everything a user is involved in (purchasing and payment authorization), but not invoice generation. The pre-signed URI used for incoice delivery is generated by S3, not using Cognito, so in that particular workflow Cognito is also "off the hook".
Payment service
The external payment service is clearly critical because that is how we make money. It also has the added complexity of being an external service with only a front-end micro-service internal to our application.
Front-end bucket
In the majority of cases, CloudFront will not reach out to the front-end S3 bucket to service the static site or front-end application: it will almost always serve from its own cache. That means that the front-end bucket may be temporarily down without ever being noticed. Nothing in the application ever writes to the bucket, so as long as CloudFront can access it "occasionally" we can ignore this bucket for the remainder of the analysis.
Invoice bucket
The invoicing micro-service writes to this bucket to deposit invoices. The proxy service can then generate pre-signed URIs for those invoices when requested, with which the front-end can download those invoices. These pre-signed URIs are not directly made visible to users: they are redirected to from the APIs. This puts the bucket on the critical path for one of the four scenarios.

This leads us to the table below:

Component On critical path for purchase On critical path for invoicing On critical path for inventory On critical path for payment Critical
AWS API Gateway x     x TRUE
Identity-aware proxy x     x TRUE
Aggregator         FALSE
Invoicing   x     TRUE
Inventory     x   TRUE
Profiles x x   x TRUE
Payment       x TRUE
Simple Queue Service x x x x TRUE
Data cache x       TRUE
Database x x x x TRUE
Cognito x   x x TRUE
Payment service       x TRUE
Front-end bucket         FALSE
Invoice bucket   x     TRUE

These is no hard-and-fast rule on whether to exclude a service from further analysis: some of these are debatable. In this case, I’ve mostly looked at whether a user interacts with the service (directly or indirectly) when interacting with the application in one of the critical scenarios, but which scenarios are considered critical is more or less arbitrary (e.g. profile editing is excluded for some reason) and failure may well affect the application in these scenarios even when the user doesn’t directly interact with them. In this case, however, we’re considering a context in which we’ve never done a DFMEA for this particular application before, and we need to do one in a reasonable amount of time. We’re effectively waterlining our analysis and including only things that are “above the waterline” knowing full well that the things we exclude for now may bob the heads above the water sooner or later.

Determine the failure mode

The next step in the analysis is to determine the failure mode of each identified component. For SaaS, PaaS, and IaaS components, this requires some analysis of the component’s documentation which will tell you, for example, that storage failure may result in storage becoming temporarily read-only, or becoming significantly slower than normal. I will not go into the details of each service the Crassula application uses, because that is not the focus of this post. What I will point out, though, is that those documented failure modes should inform your code’s design.

For example, storage temporarily becoming read-only may result in write operations to that storage failing. When that happens, depending on which part of the application you’re in and the code for that particular micro-service, that could result in any number of things: either the operation fails, failure is reported into some logging mechanism, and human intervention is needed to retry the operation; the operation fails but is kept alive and retried until it succeeds; the operation is canceled, the message the micro-service was acting on is either never consumed or put back on the queue, and it will eventually be tried again; or some other combination of retries, notifications, ignored errors, etc.

Depending on the application and the use-case, any one, of these options may be acceptable, or human intervention may never be acceptable. It really depends on what the business impact of failure is – which will be our next question.

The same goes for failures of the API Gateway, any of the Lambda functions, the Simple Queue Service, the DocumentDB, Cognito, and the third-party payment service: their documentation will tell you how they can fail, and how to detect such failures. Your architecture and your code will tell you how failures are handled, and whether there’s a trace of failure in your logs (which allows you to monitor the health of the system), whether a human needs to be made aware of the failure, etc.

As you’re reviewing the documentation of each of these services, you’ll likely want to capture some testable functional requirements around error handling and you may find you want to use some libraries to wrap the service APIs to encapsulate that error handling. This is one of the reasons why doing this analysis relatively early in the development process, if possible, is beneficial.

Determine what the user-visible effect or business impact of failure is

If at all possible, when a user has submitted their request to buy a beautiful flower and the website has accepted that request, the user should be confident that the beautiful flower they chose to buy will be delivered to them (or their sweetheart) in due time. They will be charged for that service, but their job is done at that point.

Of course, we know that once every 2000-or-so orders, something may go wrong: the invoice may not have been generated even though the order went through, was paid for, and the flowers were delivered, there may have been a behind-the-scenes intervention by a human to get the payment settled, etc. This may not really be a problem: perhaps most of our customers never look at their invoices anyway. It may also be a huge problem if that error happens when a wedding planner can’t get their invoice for the thousands of flowers they bought for the royal wedding, and because of that can’t get paid for their own services or fail an audit.

This part of the analysis, then, is to determine two things: for each of the failure modes identified, how would a user, a paying customer, the person we don’t want to piss off, be impacted; and what is the business impact on our company? Essentially, this tells us the cost of failure.

Again, I won’t go through the entire application step-by-step for this post, but we’ve already seen the partial list of use-cases from our PO. This analysis requires a bit of a deeper dive: it requires you to step through the workflows for each of those use-cases, determine whether the failure modes you’ve identified affect those use-cases or that workflow, and decide how bad that would be for your user, and for the company.

We’ll also findm, through these scenarios, that our application’s observability may become very important: failures that require human intervention may require a human being notified, getting an SMS or a push notification.

Detecting failure

Any error or failure that is not handled in the application in such a way that there is no impact to the end-user (e.g. by successful retries) should be logged with sufficient information for the effects to be mitigated or, if there is an underlying bug, for that bug to be found and fixed. Those logs should contain enough information for customer complaints to be traced back to original failure, and should be machine-readable as well as human-readable.

This is both harder and easier than it seems: it is not that hard to catch every error and log it, it is also not that hard to assign a correlation ID, or a tracking ID, to every request and include it in every log pertaining to that request. Technically, this is all feasible and fairly straight-forward. Where it becomes more complicated is when you’re not in control of all of the software you’re using in your application – and you almost never are. Software tends to hide failures, and hide pertinent information about failures. Some errors are ignored by default, especially if they don’t result in exceptions, and many errors are explicitly ignored by a “catch all” construct that will simply pretend nothing happened.

Failure detection is also hard to test, because most failures are unexpected. If you haven’t done the failure mode analysis up-front and are doing it after the fact, you are likely to have missed failure modes in your unit tests, code reviews, etc.

Many IaaS, PaaS, and SaaS services generate their own logs as you use them, so outside of your application code there may be a treasure trove of logs with detectable errors that you can tie back to your own failure logs with those same, system-generated, correlation IDs. There are also log analysis tools like CloudWatch, as well as third-party tools, that can be part of a monitoring solution.

Aside from logging, there are other tools you can employ to make sure your application is still running. You could, for example, set up availability probes to make sure your micro-services are still all healthy by having them all report the version of the running software and the health of the underlying resources. You could also implement regular “synthetic transactions”: real transactions that just won’t end up with a flower being delivered because the delivery address is the shop’s own address and your employees know what to do with those particular orders. Depending on how “deep” you go with synthetic transactions, how much work is actually done for each of them, there may be a cost that may make it prohibitive for automation. For example, if your synthetic transaction really charges $100 to your company credit card and you therefore still have the third-party payment service’s service charge to pay, you may only want to do it if you have a doubt about some part of the system working correctly.

Regardless of how failures are detected (application logs, resource logs, synthetic transactions, availability probes, etc.), there are two things you will want to monitor: you’ll want to be sure that you are notified if (and only if) human intervention is needed, and you’ll want to be able to see, at a glance, whether your system looks healthy. If you see a trend of certain types of failures that may lead to your online shop going down, you’ll want to know about it before any significant events (Valentine’s day, Christmas, etc.) occur.

Mitigation, remediation, restoration

Once you’ve figured out how things can fail and how you know they failed, you need to decide what to do when you know they failed. There are three categories of things you can do: you can limit the fall-out, accepting that things can and will fail and actively limiting the impact of such failures; you can try to make sure it never happens again, completely eliminating the threat of that particular failure mode; or you can accept that the failures will happen with the impact they have, and fix whatever impact that is when it happens. These three categories are mitigation, remediation, and restoration.

When you’re looking for mitigation strategies, the low-hanging fruit is usually bunched up inside the application: you’re looking to reduce the severity or impact of a failure. The most obvious approach to this is to retry. This is easiest to do if the action is idempotent: if doing something twice does not cause the effect twice, you can retry as often as you like.

Idempotent actions are actions that have no more effect the second time they are performed. For example, emptying a coffee cup into a sink is idempotent, because if you do it a second (or third) time, the cup is already empty so no coffee is actually moved into the sink but the operation is still successful. Taking a sip of coffee, on the other hand, is not idempotent because at some point you will run out of coffee and the operation will fail. Retrying it in that case will result in the same failure, again and again (so failure is idempotent whereas success is not).

In software things are sometimes a bit more complicated: whether or not an operation is idempotent often depends more on which effects you care about than on the actual effects. Touching a file, for example, will create it if it didn’t exist and update its last-modified timestamp if it did. If you only care whether the file exists, touching it is idempotent. If you care about the timestamp, it is not.

Still, idempotent or not, very often, the side-effects of retrying a partially-successful action are often less undesirable than the effect of partial failure. When that is the case, retry.

There are, of course, mitigation strategies other than retrying and retrying too often is the definition of insanity. In the Crassula application, however, we don’t have any services that need a “plan B” as mitigation: if an order fails to be put in the database, don’t consume the message from SQS. If the proxy fails to send a message, retry or return the condition to the caller. Similar mitigation tactics throughout the application sum up to the same mitigation strategy for the application: retry, then gracefully fail and report failure.

Remediation is different from mitigation in two respects: the first is its objective, the second is how it is implemented. The objective of remediation is to undo the effects of faults that could not be mitigated, and to try to make sure they don’t happen again. As such, within the application there is only so much we can do: if the fault is the failure to meet a performance requirement this may be remediated by deploying additional resources, adding scalability and elasticity to the application, but that has so far been outside the scope of this analysis and will, for this post, remain so. For other tpes of failures, the application is generally limited to producing an accurate and complete account of the error it detected: what it was doing, why it was doing it, when the incident occurred, and how it failed. Once all of that ha been accounted for (logged), a human can pick up the pieces, determine if this failure was due to a bug and, if so, repair it; and manually perform whatever action failed.

This last bit is close to, but still different from, restoration. Restoration is the process of returning a system to a previous functional state. Effective restoration requires the DevOps team to take a step back, see that the application is crumbling, determine why that is, and act. This means they need to look not only at the current errors, but at the trend. This requires a level of observabiity for the application that many lack. Is there an unusually high load because Valentine’s Day is just around the corner? Is there a recent change that introduced a new bug? Would rolling back yesterday’s Friday afternoon deploy break more than it fixes? Did those idiots really deploy on a Friday afternoon ahead of a long weekend without telling me? I had plans, dammit!


A design failure mode and effect analysis allows us to answer the question “How does it fail?” and highlights the importance of the question “How do I know it failed?” It emphasizes the importance of observability, which is needed whenever remediation of restoration is needed (which is whenever mitigation fails or is inadequate). It also highlights mitigation, which is turn is something you can look for in code reviews and test for in unit tests and integration tests. They do take time and effort, require insight into the application achitecture and, to an extent, its code, and may lead to code changes, process changes, etc. That is not a reason not to do it, but it is a reason to plan for it.

  1. The number defines a monthly “downtime budget”. In this case, we’re going for a “two nines” availability. As a rule of thumb, you can assume that every time you add a nine (e.g. 99.9% is three nines, 99.99% is four nines, etc.) you multiply the cost of your solution by ten. A 99.5% uptime objective gives you a downtime budget of 3:36 hours per month. Alternatively, you can see it as giving permission to fail once every 200 queries. There is no such thing as 100% uptime. 

  2. I shamelessly stole these questions from Google’s excellent resources on SRE