Get Error Handling Right

A peculiar side-effect of being human is that there’s some built-in optimism in us — when coding, the first tendency is always to code for the most generic, most positive outcome. Error handling often becomes an afterthought, something that has to be added or dealt with, because once you release your code to production, reality strikes.

The same pattern can be observed with UI designers, especially junior ones. When designing a screen with a list of items, a designer will most likely show a screen full of items, and stop there. What the same screen may look like when there are no items is an important part of the user experience, but because of the built-in human optimism, it just gets forgotten. And if your designer ever provides you not just empty list visuals, but error scenario visuals as well, value them, designers with this level of diligence are rare.

But it’s not just designers and programmers. All humans are tuned to think in terms of the optimistic case, and this makes implementing error handling correctly a lonely exercise in self-restraint and diligence. Your product manager, peers, and superiors, will all see the optimistic case working, pat you on the back, and say “Great work, the product is almost ready!” If you now need two weeks to “deal with error handling that we left to the end”, you’ll be stuck working on something that most people are likely to view as low value work that doesn’t affect the resulting product much.

The corollary of this is that you should get error handling right from the start. It should be a first-class citizen in your code, never an afterthought. That way, by the time the optimistic case is working and ready to be shown to colleagues, the error handling case is also implemented.

Let’s look at some concrete aspects of error handling and ideas for doing it right.

Why is it so difficult?

Programmers who haven’t figured out error handling tend to fall into two categories.

Category 1. Oblivious. These programmers don’t think error handling is difficult. Their approach to error handling typically amounts to “just do something to make the compiler shut up”. Concrete techniques include

“just return null, let others deal with it”

function doSomething(item) {
    if (item === null) return null;
    // actually do something
    ...
}

“just rethrow, let others deal with it”

void doSomething(Item item) {
    try {
        doSomethingDangerous(item);
    }
    catch(SomeCheckedException e) {
       throw e;
    }
}

“just throw, somebody will deal with it”

function doSomething(item) {
    if (somethingWrong(item)) throw "something is wrong"
    // actually do something
    ...
}

“ignore modelling by assuming the positive case”

let doSomething (maybeItem: Option<Item>) : unit =
    doSomethingWithItem (Option.get maybeItem)

This haphazard approach to error handling is dangerous because it’s easy to misattribute blame. When bugs manifest, the bug report gets assigned to the developer. The developer maybe succeeds at reproducing the bug, traces it around the code base, spending hours on it because of the convoluted, unpredictable nature of the error handling code. This time is essentially a waste, but it’s easy to justify it to oneself, and even feel good about the experience. “Oh, this one was a tricky bug! And I figured it out! I’m a ninja!” It’s easy to have a complete, settled world view where “bugs are tricky and take time to debug, it’s part of the job”. With such a world view, it is difficult to go beyond and learn to program better.

Category 2. Aware. These programmers know that it’s possible to do error handling well, and elegantly. They’ve seen it done. But they haven’t personally done it enough to really internalize the tools and primitives. So whenever an error scenario that needs handling creeps up, it causes much anxiety. They know that there’s a right way to deal with the error, but figuring out what that right way is takes considerable effort. And when they are low on energy, a simple throw “deal with error handling" is just a few keystrokes away, making all the problems go away for the moment. The guilt, though, remains.

I think one reason doing error handling well is difficult is that we are never taught it explicitly. The “interested in the optimistic case” human nature comes up during education as well — it’s unlikely you can keep students focused an excited when the subject is not “building an actionable feature”, but “handling error cases”. So we’re never taught a systematic approach to error handling, which is a great shame.

Another reason for the error handling being difficult is that few go through the full process of figuring out what “error handling” actually entails. One needs to understand what types of errors may come up in a system, what options for handling them exist, and what goals each option serves.

Types of errors and handling options

There’s not a lot one can do when an error happens. The choices are:

do nothing and let the error propagate, causing whatever may happen
if the interaction that resulted in the error started with the user, show them an error message
automatically retry in the hope that the error was intermittent
recover from the error in some way

“This should never happen” error and closing the feedback loop

We’ve all seen this sort of code before:

if (someCondition) {
    return someValue;
}
else {
    throw "Should never reach here";
}

The programmer has some knowledge about the system that they are building that cannot (or could have been, but was not) encoded in the model using types, so they are forced to throw this awkward looking error.

How should we handle this error? Should we handle it at all?

In a large and complex system, if there’s an error scenario declared somewhere, eventually it tends into manifest into an actual error. What do we want when that happens? For this type of error, a reasonable response is “we want the programmer to be notified that their assumption failed”.

How can we do that? By “closing the feedback loop”.

Design error logging to be useful

Consider this code:

fileNames.forEach(fileName =>
    try {
        const content = readContent(fileName)
        doSomething(content)
    }
    catch (e) {
        logger.error("Failed to read file")
    }
)

The error message is informative, it tells us what happened. But when we get a notification from our error logging system, saying “Failed to read file” occurred, what do we do? The first question we’re probably going to have is “what file”? And at this point, there is no easy way to get that information. In the general case, the best we’ll be able to do is change the message to

logger.error(`Failed to read file ${fileName}`)

redeploy to production, and wait for the error to happen again. Knowing the filename would give us a foothold to start debugging, but we actually have more useful information at the error site that could help us diagnose and resolve the issue, the error itself. We should lot it too.

logger.error(`Failed to read file ${fileName}, error was ${e}`)

This gives us all the context we have about the error occurrence, but there are some more considerations to think about when designing for a smooth error handling experience.

In a large system, errors will happen. Dozens, maybe hundreds, maybe thousands per day. Time will be limited as we’re working on new features. So we can only deal with the most critical errors. That means we need to somehow know how critical a given error is. One simple metric is “angry users emailed us about it”. It’s a decent enough metric, but it’s also risky. Out of ten users to repeatedly get an annoying error, nine will probably ditch our product for a competitor’s, and only one will actually email us about the error.

So we need an internal metric for criticality of errors. A reasonable way to go about calculating such a metric is by multiplying a severity score by the number of occurrences during a fixed time period, adjusted for the number of users experiencing the error (since one user experiencing an error a 100 times is probably less critical than 50 users experiencing the same error 2 times). The severity score needs to be manual input to the system — only a human engineer can assess the impact of an error, by examining the code and figuring out how annoying the effects of it are on the end user. The number of occurrences, though, should be calculated by the error aggregating system.

Aggregators differ in their features. The most simplistic just group errors by the error message. In such a system,

logger.error(`Failed to read file ${fileName}`)

would generate one unique instance per file, and only group occurrences of the same file together. That’s not what we want. We want all “failed to read file” errors to be grouped together, regardless of the file name, to get an accurate occurrence count.

Most aggregators will allow us to separate the message and the contextual data, so we’ll be able to do something like this:

logger.error(
    "Failed to read file",
    {
        fileName: fileName,
        error:    e
    }
)

This way, we will only get one top level error for “Failed to read file”, and all individual occurrences will be counted under it. We’ll be able to examine each occurrence to see the file names, and if our aggregator allows us, even query and aggregate by file name.

So, learn what features your log aggregation tools offer, and design your error logging to be useful, from the start, instead of later, when you realize that the error you logged doesn’t actually help you diagnose the problem. A good way to do this is to always ask yourself, when you’re logging something, “what information will the receiver of this log message need to debug this issue?”

Errors as first-class citizens

The best way to do clean error handling is to acknowledge when errors are expected to happen, and model for these cases using monadic types like Option and Result. For example, if you have a function like this:

let computeSomething (data: Data) : int =
    // code that in some cases is expected to fail

its return type should not be int, but should instead be Result<int, SomeError> . That way, you are being honest with your callers — you are declaring that if all goes well, you will return an int, but if something goes wrong, they will have SomeError to deal with. The caller is then forced to either handle the error case, or declare their own return type as a Result and let the caller deal with it.

Semantically, it’s always correct to model an “error may happen” situation with the Result type, but there are times when convenience may overrule semantics. Sometimes the error case is obvious, and it is sufficient to return an Option instead of a Result, so the None case represents the obvious error scenario. A contrived example:

let findItem (items: List<Item>) (index: int) : Option<Item>

Here the obvious implicit error is “index is out of bounds”. We could have modelled it as

type FindItemError =
| IndexOutOfBounds
let findItem (items: List<Item>) (index: int) : Result<Item, FindItemError>

but this mildly risks resulting in more noise than usefulness. In cases of single, obvious errors, an Option is usually a good choice.

The right place to handle errors

There are three rough categories that errors tend to fall into, and each category has a different strategy for handling them:

errors that we can meaningfully recover from
unexpected errors
errors happening as a result of processing a user’s action

close the loop with the programmer
beware of human nature, make it easy to do the right thing
handle the error cases in the right place (similar to how validation should be done as close to the source as possible)

Next: Avoid Mutation ⇒

Decent to Great

Search This Blog