Sunday, November 29, 2015

A Collection of Principles for Fail-Fast

Previously I blogged about how EdgeHTML has adopted a model of fail-fast by identifying hard to recover from situations and faulting, rather than trying to roll back or otherwise proceed. There we covered a lot of details on the what and even the how. At that time I didn't establish principles and since writing that article I've received a lot of questions from my own developers around when to use fail-fast. So here it is, the principles of fail-fast.

Principle #1 All memory allocations shall be checked and the process fail-fast on failure.

Follow the KISS principle and just assume that all memory conditions (including stack overflows) are leading to a situation in which even if recovered the first, second or third party code will not run correctly.

Exceptions:

Exploratory allocations may be recoverable. Textures are a commonly used resource and are limited in availability. So some systems may have a recovery story when they can't allocated. However, even these systems likely have some required memory, such as the primary texture, and that should be demanded.

Principle #2 Flow control is ONLY for known conditions. Fail-fast on the unknown.

When writing new code favor fail-fast over continuing on unexpected conditions. You can always use failure telemetry to find common conditions and fix them. Telemetry will not tell you about logic bugs caused by continuing on the unexpected path.

A prime example of this is when using enumerations in a switch. Its common practice to put a non-functional default with an Assert. This is way too nice and doesn't do anything in retail code. Instead fail-fast on all unexpected flow control situations. If the default case handles a set of conditions, then put in some code to validate that ONLY those conditions are being handled.

Third party code is not an excuse. It is even more important that you use fail-fast to help you establish contracts with your third party code. An example is a COM component that returns E_OUTOFMEMORY. This is not a SUCCESS or S_OK condition. Its NOT expected. Using fail-fast on this boundary will provide the same value as using fail-fast in your own memory allocator.

Exceptions:

None. If there is a condition that should be recovered then it is a KNOWN condition and you should have a test case for it. For example, if you are writing code for the tree mutations caused by JavaScript operations on the Browser DOM, then there are known error recovery models that MUST be followed. No fail-fast there because the behavior is spec'ed and failure must leave the tree in a valid state. Maybe not an expected state for the developer using the API, but at least spec'ed and consistent.

Principle #3 Use fail-fast to enforce contracts and invariants consistently

Contracts are about your public/protected code. If you expect a non-null input, then enforce that with a fail-fast check (not much different from the allocation check). Or as before with enumerations, if you expect a certain range then fail-fast in the out of bounds conditions as well. When transitioning from your public to your private code you can use a more judicious approach since often times parameters have been fully vetted through your public interface. Still, obey the control flow principles.

For variable manipulation within your component, rely on checks for your invariants. For instance, if your component cannot store a value larger than a short, then ensure that down casts aren't truncating and fail if they do. This classically becomes a problem between 32 and 64-bit code when all of a sudden arbitrary code can manipulate values larger than originally designed for.

While a sprinkling of fail-fast around your code will eventually catch even missed invariant checks, the more consistently you use them, the closer your telemetry will be able to point you to the sources of failure.

Exceptions:

None. Again, if you find a condition hits too often, then you'll be forced to understand and supply a fix for it. Most likely a localized fix that has little or no impact on propagating errors to other surrounding code. For instance, truncation or clamping can be a designed (and perfectly acceptable) part of the component depending on its use case.

Principle #4 If you are unsure whether or not to use fail-fast, use fail-fast

This is the back-stop principle. If you find yourself not able to determine how a component will behave or what it might return (this can happen with black box APIs or even well documented, but closed APIs) then resort to fail-fast until you get positive confirmation of the possibilities.

As an example some COM APIs will return a plethora of COM error codes and you should not arbitrarily try to recover from the various failures or figure out which codes can and can't be returned. By using fail-fast and your telemetry pipeline you'll be able to find and resolve the sets of conditions that are important to your application and you'll have confidence that your solutions fix real world problems seen by your users.

Oddly, this is even more critical when working on pre-release operating systems, services or APIs. Often the introduction of a new error code or the increase in a specific set of error codes is indicative of an OS level bug. By tightening the expectations of your application on a specific API surface area you become one of the pinning tests for that API. While APIs do change, having unexpected or new behavior propagate through your application in unexpected and unintended ways is a bug waiting to happen. Better to crash and fix than to proceed incorrectly.

Exceptions:

Yes, of the fail-fast variety please ;-)

No comments:

Post a Comment