There are two ways to handle abnormal conditions in a program: errors and assertions.
Errors are a controlled mechanism by which the program propagates details about a faulty condition up the call chain—be it with explicit error return statements or with exceptions. Errors must be used to validate all conditions that might be possible but aren’t valid given the context. Examples include: sanitizing any kind of input (as provided by the user or incoming from the network), and handling error codes from system calls or libraries.
Assertions are a fatal mechanism by which the program crashes immediately when it encounters an invalid state. Assertions are appropriate to validate program logic: in other words, to specify conditions about the state you expect the program will have at a given point. If the state is not what you thought it ought to be, there is a bug: either the assertion is wrong because your mental model about the program is incorrect, or the program’s logic is actually bogus. The program must terminate in either case: continuing with an invalid state can cause more serious problems down the road like data corruption or, worse, a privilege escalation.
Assertions are pretty tragic, aren’t they? They kill the program when they trigger. And they are also easy to misuse, for writing an assertion does not require changing any APIs in regards to error codes. When working on existing code, one can misplace an assertion in a code path where an actual error check should be used. The reasons are varied: expedited “totally-safe one-line fixes”, programmer inexperience, honest mistakes when assuming a piece of data is sane when it isn’t, etc.
To illustrate the problem assertions can cause, let’s talk about a specific query of death (QOD) problem.
The disaster
Let’s consider a distributed system composed of a single frontend controller and a set of backend servers, each responsible for a different part of a dataset. The frontend server receives user requests and redirects them to the backends. One of such requests is a textual search query that requires the frontend to contact all backends for processing.
Let’s also consider that the programmer has made a little mistake in the backend’s query serving path and has used an assertion to validate the contents of the request from the frontend. After all, we trust the frontend, so the contents of the request must be valid, right? But… the request contains the original query that the user provided so the assertion triggers when this query is malformed.
Now, what happens when a user sends an invalid query, be it malicious or not? The frontend happily redirects the query to all the backends and all the backends trip over the assertion… crashing in unison. Oops. We have a correlated failure against which no replication would have helped. What’s worse: if you have sufficiently smart traffic routing, your routing system might determine that the cluster has died and decide to redirect the query to a different cluster… triggering the same failure on a new set of servers. Rinse and repeat. Our innocent assertion has caused a global cascading failure, taking all of our service down.
Alright, that was bad. You go ahead and write a blameless postmortem in which you identify the assertion as the root cause. As a corrective measure, you fix that piece of code to handle the condition with a graceful error. And as a preventive measure, you reassess the validity of all assertions in your codebase and then decide to forbid assertions in the serving path, project-wide.
The fallacy
Wait, what? Blindly forbidding assertions is not going to fix the problem. This policy change will prevent a possible cause but will not make the service more reliable. After all, this new policy only eliminates a certain class of bugs. What if, next time, the invalid query triggers a correlated crash because of a null-pointer dereference? Or what if handling the query tickles a kernel bug? Or what if you trip over a processor bug? Or what if, instead of crashing, you cause a deadlock and your backends stay alive but stop responding to just a certain kind of queries?
There is no way you can eliminate all classes of bugs—and if there is, nobody has discovered it yet. So what do you do? You rëengineer the system so that the possibility of a correlated failure doesn’t exist. An idea: instead of sending the query to all servers at once, send the query first to a single backend server and, if it don’t reply within a short period of time, abort the query because it’s risky.
A word about Go
And that, I believe, is the rationale behind Go’s ban on assertions, which took me on a little rant in my Rust vs. Go comparison. After all, Go originated at Google and we have had our fair share of query-of-death problems. In fact, the postmortem story described above is based on, at least, a couple of real-life situations I’ve previously encountered—but I cannot remember which projects these were about.
I suspect the Go language designers, in their pragmatism, wanted to eradicate this kind of issue… but as we have seen… that’s not of much help: you can still crash a Go program due to bad pointer references and you can deadlock it, so query of deaths are still perfectly possible in a Go service. And mind you, they could be possible as well in a Rust service because bugs can be made in any language. So, design your systems for reliability.