Saturday, June 27, 2015

Why Event Constructors are Way Better than Methods

Most developers will never need to deal with synthetic events and event constructors. You should be entirely fine hooking existing event handlers who already provide a fully filled out event object specific to the event which is being fired. Let's take a couple of examples of existing events. We'll do an onload handler which is really basic and an onclick handler which contains a ton of information.


The logging output from this shows us that the load event fires a standard Event and that we inherit from Event. The click event instead fires a MouseEvent and inherits from both Event and UIEvent. This will be key in our understanding of why event constructors are a very necessary improvement to the specs. The inheritance of more specific events from less specific events creates a versioning issue that has to be solved if you want to add new members to the base event types without existing code having to change. This becomes more obvious when we look at the init methods and how they are organized.

Init Methods


At each level in the hierarchy we have both a way to create an event of a given type and a method which initializes ALL of its members. For the Event.initEvent there are 3 members. The type of the event (which is really more like a name), whether or not it bubbles and whether or not it is cancelable. Pretty basic stuff. So to create and initialize an Event we'd run the following code.


A mouse event should be pretty easy too then right? Well, not really. The first thing to note is that the MouseEvent.initMouseEvent method takes 15 parameters. The first 3 of those parameters are the same as the initEvent method from earlier. Hum, and we have a UIEvent in our hierarchy so maybe some of those other parameters are also inherited. Turns out UIEvent.initUIEvent does exist and it takes 5 parameters (3 from initEvent and the remaining 2 specific to itself). Let's create a mouse event so we can see how tedious it is. We will also initialize a new mouse event from an existing one, which is fairly common for frameworks to do so they can overcome browser quirks related to preventDefault and return value behaviors.


It isn't easy to figure out and type 15 parameters. Many of the parameters have the same types and JavaScript is so nice that most of the time it would do conversions for you as well even if you got it wrong. This all conspires to make it hard to "get it right" when filling out an event manually. When we do the copy itself, the format is very verbose as well and we have to duplicate everything and ensure we get the ordering right since if we transposed the ctrl and alt keys, how would anyone ever know? That could fly under the radar on a top site for years.

We can also see that more derived methods are composed from their base class implementations. This means changing initEvent would cause all of the derived methods to shift their parameters. This would easily break the web, since the web doesn't adapt to that kind of change. We also don't want to have to add methods for each version while still maintaining support for the older stuff. We "could" kind of do this because these methods require a specific number of parameters and we could switch on that. Even doing this the developer would often get it wrong and the browser couldn't provide them with meaningful feedback, so the developer experience would be pretty poor.

So how can we solve this problem and get rid of these nasty init methods?

Event Initialization Dictionaries


It turns out, named parameters in other languages try to solve exactly this problem. By making all arguments optional with defaults and allowing the user to specify the ones they want to change by name, you can come up with an approach that versions pretty well. In JavaScript we can implement this functionality as a dictionary. This is a set of name/value pairs that is passed into the initialization method instead of just ordinal values.

Dictionaries in our case are defined in WebIDL. They have many of the same properties as interfaces. They can be derived (good thing, since we need that for UIEvent to derived from Event), they can contain any number of named properties, and each property can be supplied a not-present default. The ordering of properties no longer matters, since we grab them by name. Properties can be added on any base dictionary and they'll automatically be inherited by the derived dictionary.

Finally! We have the ability to enhance the event model, move things around to where they make sense and so without completely breaking all of the shipped code on the web.

Here is a peek at the dictionaries we use for the MouseEvent. Note we eat the complexity here, not you. All you have to do is pass us a simple object replacing only those properties you want and we'll do the rest.


Event Constructors


Now we are ready to tackle event constructors themselves. This is how you build a new event and initialize it at the same time. The created object is still "not dispatched" and so you can also call initMouseEvent or any of the less derived versions as well to change behavior, but ideally everything you would need to set could have been done during construction.

All event constructors take the form new EventType(DOMString type, optional EventInit dict).

When you don't specify a dictionary then this is equivalent to createEvent except that none of the "special names" like "MouseEvents" (note the s) are available. You must specify DOM type which is where the constructor is implemented.

Let's rewrite our mouse events example now using event constructors and see if we can save ourselves from pulling our hair out.


This code is almost readable. We had to specify only 3 parameters this time to match our previous initMouseEvent case and the dictionary is very consumable. The names in the dictionary match the names on the final event. This is actually so powerful that we can commit the first event as the dictionary to our copied event and it will actually work! This is because a dictionary follows the standard JavaScript duck typing patterns we've all grown used to.

Support for Event Constructors


All of the major browsers now support event constructors in their latest releases. Support is not 100% complete or interoperable in all cases though. For instance in EdgeHTML we prioritized the most widely used event constructors and so our support across all events won't ship in Windows 10. Here is a quick list for events we don't support and if you see something on the list that you'd like prioritized feel free to file a Connect bug against us, ping me on Twitter or reply here and we'll see what we can do to bump up the priority.
AnimationEvent, AudioProcessingEvent, BeforeUnloadEvent, CloseEvent, DeviceMotionEvent, DeviceOrientationEvent, DragEvent, ErrorEvent, GamepadEvent, IDBVersionChangeEvent, MSGestureEvent, MSManipulationEvent, MSMediaKeyMessageEvent, MSMediaKeyNeededEvent, MSSiteModeEvent, MessageEvent, MutationEvent, OfflineAudioCompletionEvent, OverflowEvent, PageTransitionEvent, PopStateEvent, ProgressEvent, SVGZoomEvent, StorageEvent, TextEvent, TouchEvent, TrackEvent, TransitionEvent
Even though we are providing an easier and more consumable way to create these native event types, the new events are still synthetic and when dispatched may still not behave like the real thing. Browser's have a lot of protections to ensure the integrity of user initiated actions. Real events generated by the browser set an isTrusted property and this is true only for browser events. You can't change this flag when building your synthetic event. Also, certain actions like opening things in another tab via the ctrlKey or middle mouse button are based on the input stack state and not the state in the event. We've recently seen bugs where pages will craft a new event and dispatch it to try and change behavior associated with the user action.

Futures


Event constructors are fairly new and so there are still some things that don't work spectacularly. It is non-trivial to derive a new event from an existing event. You have to start by having the browser create the most derived type it can, and then from there you have to swap out the prototype chain. Some things will work with this, such as instanceof, but others might not, such as the way your object stringifies.


Some more complicated behaviors we discussed, such as catching an in-flight event, cloning it, and doing your own dispatch tend to not have well defined behavior across all browsers. As time goes on more work will have to be done in the standards around whether or not synthetic events should all have the same side effects as real events. While the side effects of "click" are well understood, could you imagine sending a mouse move event and have it invoke "hover" behavior? And if you never send another mouse move or mouse up or something, have the "hover" be stuck indefinitely?

I hope you enjoyed the latest in my constructor series. If you have any questions feel free to ask my on Twitter @JustrogDigiTec.

Sunday, June 14, 2015

Improving Reliability by Crashing

When we describe the reliability of a piece of software we probably apply traits like, never crashes, never hangs, is always responsive and doesn't use a lot of memory. About the only pieces of software that meet all of these criterion are simple utilities, with years of development poured into fixing bugs and little or not improvement to the feature set. Maybe Notepad or Calc would come to mind.

We don't tend to spend much time in that software though. Software has to have a lot of functionality and features before we really allow ourselves to spend enough time their to really stress it out. But we do spend our time in large, complex programs. In fact, you may be one of the people who spends around 70% of your time in your web browser. Whether it be Chrome, Safari, FireFox or Internet Explorer. Likely nobody considers these programs to be very reliable. Yet we conduct business in them, file our taxes, pay our bills, connect to our banks. We complain about how often they crash, how much memory they consume, but fail to recognize how many complex behaviors they accomplish for us before going down in a ball of flames.

So this article is about web browsers and more specifically Internet Explorer. Its about a past of expectations and assumptions. And its about the a future where crashing more, means crashing less. Hopefully you find that as intriguing as I do ;-)

The Past - Reliability == 100% Up-Time

The components that make up Internet Explorer run in some pretty extreme environments. For instance, WinInet and Urlmon, our networking stack, run in hundreds of thousands of third party applications. They also power a large portion of the world's networking stacks. And due to their history they are heavily extensible.

Having so many consumers, you'd imagine that every crash we fix and every bit of error recovery we put in to recover from every possible situation would lead to code which is highly robust to failure. That this robustness would mean all of those hundreds of thousands of applications would have the same robustness that we do. That they listen to every error code we return and take immediate and responsible actions to provide a crashless and infallible user experience. And here I interject that you should vote on the <sarcasm> tag for HTML 5.1 so we can properly wrap comments like those I just made.

The reality is they themselves are not robust. No error checking, no null checking, continuing through errors, catching and continuing exceptions, so many bad things that we can't even detail them all, though gurus like Raymond Chen have tried. But, at least we didn't crash, and this made the world a better place. We provide the unused ability to be robust, at great expense for our own code.

To build a robust component you can either start with principles that enable this such as two-phase commit or go through increasingly more expensive iterations of code hardening. Let's talk about each of these.

Two-Phase Commit

To keep the definition simple here, you first acquire all the resources you'll need for the operation. If that succeeds you commit the transaction. If it fails, you roll-back the operation. You can imagine a lot of software and algorithms aren't really built with this idealistic view in mind. However, the model is proven and used extensively in databases, financial software, distributed systems and networking.

It is a huge tax though. And it only works if you can reasonably implement a commit request phase. In this phase you'll ask all sub-systems to allocate all of the memory, stack or network resources that they might need to complete the task. If any of them fails, then you don't perform the operation. Further, you have to implement a roll-back mechanism to give back any resources that were successfully acquired.

In a browser with large sub-systems like layout and rendering, alongside author supplied programs in the form of scripts and libraries, such a system is prohibitively expensive. While some aspects of the browser could be ascribed to the two phase commit model, how could you orchestrate the intersections and compositions of all the possible ways those sub-systems could come together to create transactions? Good thing we have another model which might work ;-)

Hardening

Hardening is the detection of a state that will cause an unrecoverable failure in your code and employing a corrective measure to avoid that state. In the simplest form, a state that can cause your program to hit an unrecoverable failure would be an OOM or Out Of Memory, that in turn propagates a null pointer back through your code. It could also throw, in which case your model would change to RAII and some form of exception handling instead of error state propagation.

With this type of programming the hardening is done through propagation of error codes. In COM, we use HRESULTs for this. When memory fails to allocate we use E_OUTOFMEMORY and so we have to turn memory allocator failures into this method. But in addition you have to initialize objects. So you end up with some allocator methods that both have pointers to return but can also return more than just one error code, something other than just the E_OUTOFMEMORY. Also, once the error codes are introduced the propagate through your function definitions and many functions must all of a sudden change their signatures. I've coded what I think is the most BASIC form of this which handles just a couple of the initial failures that you would run into, and it is still almost 100 lines of code.


Can you make this better? Well yes, with some structure you can. You can use a pattern known as RAII to handle the clean-up cases more elegantly and automatically. You can also use exceptions with RAII to protect sub-trees of code from being updated to error propagators. You have to augment this with actual exceptions that can be thrown for each error case, but that is rather trivial.

In terms of the callers, you'll need to ensure that there is always someone to catch your thrown exception. Part of our goal in using RAII + exceptions is to avoid code that handles errors. If we find that we are often introducing try/catch blocks then the amount of code increases and we find that we are spending much of our time still implementing error handling and recovery logic. 

At some point the argument of cleanliness or readability comes to bear in these cases and whether or not you use method signatures or exceptions becomes a matter of style or preference. Suffice to say, having looked at very large codebases having employed all form and variations, the percentage of code you write specifically to harden and recover is about the same either way.

Stress

How do we know what and when to Harden? Well, we commit acts of violence against our own code in the form of stress. We randomly fail allocations at arbitrary locations in the code. We spin up threads that take locks at various times that to change the timing of the threads which actually need them. When 1 of something would suffice we do 10 of them instead. We do things in parallel that are normally done synchronously. All of these contribute to inducing failures that a user may or may not see during their actual usage, but since it is one of the few mechanisms for exercising millions of lines of error recovery code, we have to fix each and every instance to unblock the next one.

Just like the static analysis tools stress can produce more bugs than a team could reasonably handle. You can also create states that are impossible to debug and track back depending on when the original failure occurred that lead to the final crash. The more you harden your code, the less likely you are to be able to discover the path that led to the failure you are seeing. After all, many hours could mean hundreds of thousands of recovered errors led you to where you are, any of which could have induced the state that caused your ultimate failure. Pretty tricky huh?

Once the product is released, then the users will also stress and you'll get more real-world usage. Those crashes will be even less actionable than your stress, since you won't be able to log or record how you go somewhere and a user could have hit the condition after may days of use. You can also not always upload the entire crash state depending on the user's preferences.

Basically hardening and stress demonstrate a curve that very quickly stops paying off for the development team. This can actually be good for a waterfall design approach though since you will get the most benefit and find the most impactful bugs early in your stress runs and then they'll get increasingly less important the closer you get to shipping. Any truly "need to fix" bugs will still bump up via hit-counts and can be correlated with user data. As a developer, this drives me crazy as I stare at the endless ocean of crashes that I know are issues but will never fix due to their hit-count.

Finding an Alternate Failure Model

So we know that two-phase commit is probably too expensive or even impossible in some large software projects. We also know that hardening and stress testing to increase reliability has its limits as well. It eventually leads to code which has few invariant conditions and simply checks for every possible thing. This is the type of code where you find a now infamous line of code that will leave your scratching your head every time
if (this == nullptr) return; // Don't crash if we get passed a bad pointer
That is one of the final incarnations of hardening in your code. People calling instance methods on null pointers and you protect all callers because you can't be bothered to go fix them all.

This brings us to a juncture where we can begin to explore another failure model. It needs to improve on some of the deficiencies of hardening while at the same time helping us achieve equal or greater reliability numbers. So what do we want to get rid of?

  1. Reduce lines of code dedicated to checking and propagating errors. We found that upwards of 10% of the lines of code in a project could be just the failure recovery. We also found that these lines of code were often not covered with our existing test cases and you can't use stress to determine code coverage as the cost is prohibitive.
  2. Allow for developers to establish invariant conditions in the code. If you can't even say your this pointer is guaranteed to be non-null, then you have some problems. But why stop there? Why not also be able to state that certain members are initialized (hardening could leave partial initialization at play) or that a component you are dependent on ONLY returns you valid values in the configurations you care about?
  3. Stress bugs are more about product issues than extreme environmental conditions. Bugs are immediately actionable because failure occurs when failure is first seen. Stress and user reported crashes can be strongly correlated.
The failure model that meets these requirements is fail-fast. It is almost the opposite of hardening. You have almost no error recovery code. You can supply almost no guarantees to your host that you won't crash them. The past, 100% up-time, is gone. User and stress alike crash fast, crash early, and if those crashes are prominent they get fixed by improving our code understanding. Our unit tests exercise all code paths possible because there are fewer of them. Our code has invariant conditions so when others use us incorrectly, they are immediately alerted.

Seems fun, but won't that lead to LESS reliability, MORE crashes, and MORE unhappy customers?

Crash Less by Crashing More

The principles behind fail-fast are that you will crash less once you fix your first few waves of induced crashes. The following steps are a guide to implementing fail-fast on an existing code base. Its a factoring tutorial if you will. And the fall-out from each step is also clearly explained.
  1. Create a mechanism for terminating your application reliably which maintains the state necessary for you to debug and fix the reason for termination. For now I will call this abandonment since it is a term we use. We also say use the verb induce to describe the process by which a component requests abandonment. [This stage creates only opportunity, and no fall-out]
  2. Upgrade your memory allocation primitives to induce abandonment. This is most likely your worst offender of error recovery code bar none. And all of those null checks spread all over your code is definitely not helping you. If you are running with throwing new you might be in better shape ;-) [This stage will likely be painful. You'll find places where your system was getting back null stuff but had plenty of memory. You'll find places where you allocated things the wrong size because of bad math. Fix them!]
  3. Work from the leaves and remove error recovery initiators in favor of new abandonment cases. You can introduce new mechanisms of abandonment to collect the right information so you again have enough to debug. [This stage will be less painful. For every 100 conversions you'll find 1 or 2 gross bugs where things were failing and recovering but were creating customer facing bugs. Now they create crashes you can fix instead.]
  4. Work your way outward and fix all of the error propagators. If they have no errors to propagate then this is easy. If there are still error propagators that they call, then you can induce abandonment for unexpected error codes. This can help you quickly understand whether or not entire code regions are ready for improvement since if a root function never receives the propagated error, then it likely means all of the children never really generate them. [By this stage you should already be generating less crashes during stress than you did while hardening. It seems counter-intuitive, but simpler code with fewer conditions which are heavily tested are just more reliable.]
I work on a huge code base and our experiences with fail-fast, in just a single release, has yielded an EdgeHTML which is nearly twice as reliable as its counter-part, MSHTML. That is pretty impressive and is based on data from our stress infrastructure. We have other telemetry which paints a similar story for the user facing experience.

For end users, they may actually see more crashes up front, while we get a handle on those things that stress has missed. We had over 15 years of hardening to account for and so we are in the infancy of reintroducing our invariant conditions and converting code through stages 3 and 4 above. Each crash we get from a user will be a deep insight into an invariant condition to be understood and fixed in a way that further improves the system. In the old world that crash would have been a serpentine analysis of logic and code flow through multiple robust functions all gracefully handling the error condition until we find the one that doesn't, patching it, creating a new serpentine path for the next crash.

I'm converting the first snippet into the equivalent fail-fast code to show you the differences. It also gives some insight into how much code and commenting gets to disappear with the model. Note, we didn't have any real control flow in our example, but fail-fast doesn't mean that control flow disappears. Functions that return different states continue to do so. Those that only return errors on extreme failure cases move to inducing abandonment.