Sunday, June 14, 2015

Improving Reliability by Crashing

When we describe the reliability of a piece of software we probably apply traits like, never crashes, never hangs, is always responsive and doesn't use a lot of memory. About the only pieces of software that meet all of these criterion are simple utilities, with years of development poured into fixing bugs and little or not improvement to the feature set. Maybe Notepad or Calc would come to mind.

We don't tend to spend much time in that software though. Software has to have a lot of functionality and features before we really allow ourselves to spend enough time their to really stress it out. But we do spend our time in large, complex programs. In fact, you may be one of the people who spends around 70% of your time in your web browser. Whether it be Chrome, Safari, FireFox or Internet Explorer. Likely nobody considers these programs to be very reliable. Yet we conduct business in them, file our taxes, pay our bills, connect to our banks. We complain about how often they crash, how much memory they consume, but fail to recognize how many complex behaviors they accomplish for us before going down in a ball of flames.

So this article is about web browsers and more specifically Internet Explorer. Its about a past of expectations and assumptions. And its about the a future where crashing more, means crashing less. Hopefully you find that as intriguing as I do ;-)

The Past - Reliability == 100% Up-Time

The components that make up Internet Explorer run in some pretty extreme environments. For instance, WinInet and Urlmon, our networking stack, run in hundreds of thousands of third party applications. They also power a large portion of the world's networking stacks. And due to their history they are heavily extensible.

Having so many consumers, you'd imagine that every crash we fix and every bit of error recovery we put in to recover from every possible situation would lead to code which is highly robust to failure. That this robustness would mean all of those hundreds of thousands of applications would have the same robustness that we do. That they listen to every error code we return and take immediate and responsible actions to provide a crashless and infallible user experience. And here I interject that you should vote on the <sarcasm> tag for HTML 5.1 so we can properly wrap comments like those I just made.

The reality is they themselves are not robust. No error checking, no null checking, continuing through errors, catching and continuing exceptions, so many bad things that we can't even detail them all, though gurus like Raymond Chen have tried. But, at least we didn't crash, and this made the world a better place. We provide the unused ability to be robust, at great expense for our own code.

To build a robust component you can either start with principles that enable this such as two-phase commit or go through increasingly more expensive iterations of code hardening. Let's talk about each of these.

Two-Phase Commit

To keep the definition simple here, you first acquire all the resources you'll need for the operation. If that succeeds you commit the transaction. If it fails, you roll-back the operation. You can imagine a lot of software and algorithms aren't really built with this idealistic view in mind. However, the model is proven and used extensively in databases, financial software, distributed systems and networking.

It is a huge tax though. And it only works if you can reasonably implement a commit request phase. In this phase you'll ask all sub-systems to allocate all of the memory, stack or network resources that they might need to complete the task. If any of them fails, then you don't perform the operation. Further, you have to implement a roll-back mechanism to give back any resources that were successfully acquired.

In a browser with large sub-systems like layout and rendering, alongside author supplied programs in the form of scripts and libraries, such a system is prohibitively expensive. While some aspects of the browser could be ascribed to the two phase commit model, how could you orchestrate the intersections and compositions of all the possible ways those sub-systems could come together to create transactions? Good thing we have another model which might work ;-)

Hardening

Hardening is the detection of a state that will cause an unrecoverable failure in your code and employing a corrective measure to avoid that state. In the simplest form, a state that can cause your program to hit an unrecoverable failure would be an OOM or Out Of Memory, that in turn propagates a null pointer back through your code. It could also throw, in which case your model would change to RAII and some form of exception handling instead of error state propagation.

With this type of programming the hardening is done through propagation of error codes. In COM, we use HRESULTs for this. When memory fails to allocate we use E_OUTOFMEMORY and so we have to turn memory allocator failures into this method. But in addition you have to initialize objects. So you end up with some allocator methods that both have pointers to return but can also return more than just one error code, something other than just the E_OUTOFMEMORY. Also, once the error codes are introduced the propagate through your function definitions and many functions must all of a sudden change their signatures. I've coded what I think is the most BASIC form of this which handles just a couple of the initial failures that you would run into, and it is still almost 100 lines of code.


Can you make this better? Well yes, with some structure you can. You can use a pattern known as RAII to handle the clean-up cases more elegantly and automatically. You can also use exceptions with RAII to protect sub-trees of code from being updated to error propagators. You have to augment this with actual exceptions that can be thrown for each error case, but that is rather trivial.

In terms of the callers, you'll need to ensure that there is always someone to catch your thrown exception. Part of our goal in using RAII + exceptions is to avoid code that handles errors. If we find that we are often introducing try/catch blocks then the amount of code increases and we find that we are spending much of our time still implementing error handling and recovery logic. 

At some point the argument of cleanliness or readability comes to bear in these cases and whether or not you use method signatures or exceptions becomes a matter of style or preference. Suffice to say, having looked at very large codebases having employed all form and variations, the percentage of code you write specifically to harden and recover is about the same either way.

Stress

How do we know what and when to Harden? Well, we commit acts of violence against our own code in the form of stress. We randomly fail allocations at arbitrary locations in the code. We spin up threads that take locks at various times that to change the timing of the threads which actually need them. When 1 of something would suffice we do 10 of them instead. We do things in parallel that are normally done synchronously. All of these contribute to inducing failures that a user may or may not see during their actual usage, but since it is one of the few mechanisms for exercising millions of lines of error recovery code, we have to fix each and every instance to unblock the next one.

Just like the static analysis tools stress can produce more bugs than a team could reasonably handle. You can also create states that are impossible to debug and track back depending on when the original failure occurred that lead to the final crash. The more you harden your code, the less likely you are to be able to discover the path that led to the failure you are seeing. After all, many hours could mean hundreds of thousands of recovered errors led you to where you are, any of which could have induced the state that caused your ultimate failure. Pretty tricky huh?

Once the product is released, then the users will also stress and you'll get more real-world usage. Those crashes will be even less actionable than your stress, since you won't be able to log or record how you go somewhere and a user could have hit the condition after may days of use. You can also not always upload the entire crash state depending on the user's preferences.

Basically hardening and stress demonstrate a curve that very quickly stops paying off for the development team. This can actually be good for a waterfall design approach though since you will get the most benefit and find the most impactful bugs early in your stress runs and then they'll get increasingly less important the closer you get to shipping. Any truly "need to fix" bugs will still bump up via hit-counts and can be correlated with user data. As a developer, this drives me crazy as I stare at the endless ocean of crashes that I know are issues but will never fix due to their hit-count.

Finding an Alternate Failure Model

So we know that two-phase commit is probably too expensive or even impossible in some large software projects. We also know that hardening and stress testing to increase reliability has its limits as well. It eventually leads to code which has few invariant conditions and simply checks for every possible thing. This is the type of code where you find a now infamous line of code that will leave your scratching your head every time
if (this == nullptr) return; // Don't crash if we get passed a bad pointer
That is one of the final incarnations of hardening in your code. People calling instance methods on null pointers and you protect all callers because you can't be bothered to go fix them all.

This brings us to a juncture where we can begin to explore another failure model. It needs to improve on some of the deficiencies of hardening while at the same time helping us achieve equal or greater reliability numbers. So what do we want to get rid of?

  1. Reduce lines of code dedicated to checking and propagating errors. We found that upwards of 10% of the lines of code in a project could be just the failure recovery. We also found that these lines of code were often not covered with our existing test cases and you can't use stress to determine code coverage as the cost is prohibitive.
  2. Allow for developers to establish invariant conditions in the code. If you can't even say your this pointer is guaranteed to be non-null, then you have some problems. But why stop there? Why not also be able to state that certain members are initialized (hardening could leave partial initialization at play) or that a component you are dependent on ONLY returns you valid values in the configurations you care about?
  3. Stress bugs are more about product issues than extreme environmental conditions. Bugs are immediately actionable because failure occurs when failure is first seen. Stress and user reported crashes can be strongly correlated.
The failure model that meets these requirements is fail-fast. It is almost the opposite of hardening. You have almost no error recovery code. You can supply almost no guarantees to your host that you won't crash them. The past, 100% up-time, is gone. User and stress alike crash fast, crash early, and if those crashes are prominent they get fixed by improving our code understanding. Our unit tests exercise all code paths possible because there are fewer of them. Our code has invariant conditions so when others use us incorrectly, they are immediately alerted.

Seems fun, but won't that lead to LESS reliability, MORE crashes, and MORE unhappy customers?

Crash Less by Crashing More

The principles behind fail-fast are that you will crash less once you fix your first few waves of induced crashes. The following steps are a guide to implementing fail-fast on an existing code base. Its a factoring tutorial if you will. And the fall-out from each step is also clearly explained.
  1. Create a mechanism for terminating your application reliably which maintains the state necessary for you to debug and fix the reason for termination. For now I will call this abandonment since it is a term we use. We also say use the verb induce to describe the process by which a component requests abandonment. [This stage creates only opportunity, and no fall-out]
  2. Upgrade your memory allocation primitives to induce abandonment. This is most likely your worst offender of error recovery code bar none. And all of those null checks spread all over your code is definitely not helping you. If you are running with throwing new you might be in better shape ;-) [This stage will likely be painful. You'll find places where your system was getting back null stuff but had plenty of memory. You'll find places where you allocated things the wrong size because of bad math. Fix them!]
  3. Work from the leaves and remove error recovery initiators in favor of new abandonment cases. You can introduce new mechanisms of abandonment to collect the right information so you again have enough to debug. [This stage will be less painful. For every 100 conversions you'll find 1 or 2 gross bugs where things were failing and recovering but were creating customer facing bugs. Now they create crashes you can fix instead.]
  4. Work your way outward and fix all of the error propagators. If they have no errors to propagate then this is easy. If there are still error propagators that they call, then you can induce abandonment for unexpected error codes. This can help you quickly understand whether or not entire code regions are ready for improvement since if a root function never receives the propagated error, then it likely means all of the children never really generate them. [By this stage you should already be generating less crashes during stress than you did while hardening. It seems counter-intuitive, but simpler code with fewer conditions which are heavily tested are just more reliable.]
I work on a huge code base and our experiences with fail-fast, in just a single release, has yielded an EdgeHTML which is nearly twice as reliable as its counter-part, MSHTML. That is pretty impressive and is based on data from our stress infrastructure. We have other telemetry which paints a similar story for the user facing experience.

For end users, they may actually see more crashes up front, while we get a handle on those things that stress has missed. We had over 15 years of hardening to account for and so we are in the infancy of reintroducing our invariant conditions and converting code through stages 3 and 4 above. Each crash we get from a user will be a deep insight into an invariant condition to be understood and fixed in a way that further improves the system. In the old world that crash would have been a serpentine analysis of logic and code flow through multiple robust functions all gracefully handling the error condition until we find the one that doesn't, patching it, creating a new serpentine path for the next crash.

I'm converting the first snippet into the equivalent fail-fast code to show you the differences. It also gives some insight into how much code and commenting gets to disappear with the model. Note, we didn't have any real control flow in our example, but fail-fast doesn't mean that control flow disappears. Functions that return different states continue to do so. Those that only return errors on extreme failure cases move to inducing abandonment.

No comments:

Post a Comment