Saturday, February 27, 2016

Improving Web Performance with Data Science and Telemetry

This post is about our ongoing commitment to using telemetry, data science, testing and solid engineering to achieve great results for the web. Over the course of two prior posts we've covered a 3 day hackathon in which we built DOM profiling data into the core of our browser to figure out how web authors really use our stuff. I then circled back on how we took the hack and built it into a world class telemetry pipeline, at web scale, so we could validate our local findings on the entire web ecosystem. Now I'm going to walk you through how we took a single, unexpected insight and turned it into a great product improvement.

The Insight

Insight #2: Our crawler data only had about a 60-70% overlap with our live data. This meant that what people do on the web changes quite a bit between their initial navigates and when they start to interact with the page. Our crawler was blind to big sites where people spend a lot of time and do a lot of interactions. All of those interactive scenarios were only "marginally" hit by the crawler.
This means that some APIs not on our performance optimization list started to jump up the list and became important for our team. We also started to extrapolate use cases from the data we were seeing. As an immediate example, APIs like setTimeout started to show up more since that is how dynamic pages are written. requestAnimationFrame was the same. All of the scheduling APIs moved up the list a bit when we considered the live data and presented differently than the crawler did. This was great news.
So this insight is taken from our follow-up on building the telemetry pipeline which could sample profile the entire population of EdgeHTML users to figure out the key APIs we should be focusing more testing, functionality and performance improvements on. We also hope to recommend API deprecation in the future as well, based on lack of hits on some set of APIs. That could prove huge.

How bad was setTimeout in terms of global usage? Well, it looked to be close to 15% of the time scripts were spending in the DOM (numbers are approximate). It was so heavyweight that it could be up to 2x the next closest API, which tends to be something related to layout such as offsetWidth or offsetHeight. This data was quite interesting but also very confusing. While the call counts were exceptionally high, there were other APIs that were also just as high. Also, the call counts on offsetWidth and offsetHeight were also no joke.

Once we knew that a problem existed it was time to figure out why and what we could do. There were two approaches, the first of which was to improve the telemetry. In addition to time spent in the API we decided to collect the timeout times and information about parameter usage. Our second approach was to write some low level tests and figure out the algorithmic complexity of our implementation vs other browsers. Was this a problem for us or was it a problem for everyone?

The Telemetry


Telemetry is never perfect on the first go round. You should always build models for your telemetry that take into account two things. First, you are going to change your telemetry and fix it. Hopefully very quickly. This will change your schema which leads to the second thing to take into account. Normalization across schemas and a description of your normalization process is critical if you want to demonstrate change and/or improvement. This is most telemetry, but not all. You could be doing a one shot to answer a question and not care about historical or even future data once you've answered your question. I find this to be fairly and if I build temporary telemetry I tend to replace it later with something that will stand the test of time.

The first bit of telemetry we tried to collect was the distribution of time ranges. The usage of 0, 1-4, 4-8, etc... to figure out how prominent each of the timeout and intervals ranges was going to be. This got checked in relatively early in the process, in fact I think it may be having a birthday soon. This told us that storage for timers needed to be optimized to handle a LOT of short term timers, meaning they would be inserted and removed from the timeout list quite often.

Once we started to see a broad range in the call times to setTimeout though we wanted to know something else. The API takes a string or function. A function doesn't have to be parsed, but a string does. We optimize this already to parse on execute, so we knew that string vs function wasn't causing us any differences in time to call setTimeout for that reason, however, the size of the structure to hold these different callback types was impacted, as was the time it takes to stringify something from JavaScript into a more persistent string buffer for parsing.

We also, per spec, allow you to pass arguments to the callback. These values have to be stored, maintained by GC relationships to the timeout, etc... This is yet more space. We knew that our storage was sub-optimal. We could do better, but it would cost several days of work. This became another telemetry point for us, the various ranges of arguments.

This telemetry is likely in your build of Windows if you are an Insider running RS 1. So you are helping us make further improvements to these APIs going forward. Telemetry in this case is about prioritization of an unlimited amount of work we could do based on real world usage. It helps ensure that customers get the best experience possible even if they don't understand the specifics of how browsers implement various HTML 5 APIs.

The Tests


Sometimes it is just faster to write a test and explore the space. After an hour of brainstorming and guessing (yes computer scientists like to imagine how the code should work and then make guesses about why it doesn't work that way, it is one of our most inefficient flaws ;-) we decided to write some tests. This is where Todd Reifsteck (@ToddReifsteck) stepped in and built a very simple, but compelling test to show the queueing and execution overhead across increasingly large workloads. Original GitHub File or run it from RawGIT here, though I've embedded it as a Gist below.

This test showed that Internet Explorer was using a back-end that demonstrated quadratic performance as the number of timers grew. EdgeHTML inherited this same implementation and though we knew about the problem, it had been reported by several customers, it was not actually a problem on the web. It was more of a website bug that led to very large numbers of timers being present and eventually the browser slowed down.

More telling though was that even at very low numbers EdgeHTML was some ways from Chrome and FireFox. This meant that not just our storage issues were in question, but we also had some nasty overhead.

The Numbers


When we first collected the numbers it was very obvious that something needed to be fixed. The legacy implementation had clearly been optimized for a small number of timeouts and our algorithms had significant warm-up costs associated with the 10 timer loop that weren't present in other browsers.

Internet Explorer/Microsoft Edge (Before)
Iterations Scheduling Time Execution Time Comments
10 8.5ms 8.5ms Warm Up
100 0.7ms 1.2ms
1000 6.4ms 28ms
10000 322ms 1666ms

Chrome (32-bit)
Iterations Scheduling Time Exec Time
10 0.01ms 0.01ms
100 0.39ms 0.5ms
1000 3.6ms 5.3ms
10000 38ms 45ms

From the list you can clearly see that Chrome has pretty much a linear ramp in both scheduling and execution. They also have no warm-up associated costs and seems almost ridiculously fast for the 10 iterations case.

It turns out that the quadratic algorithm was impacting both scheduling and execution. It was precisely the same helper code that was being called in both circumstances. Eliminating this code would be a huge win. We also had more than one quadratic algorithm so there was a hefty constant (you normally ignore the 2 in 2n^2, but in this case it was important, since to get not n^2 both of the algorithms had to be removed).

Removing those and we were within the ballpark of Chrome before we accounted for our compiler optimizations (PGO or Profile Guided Optimization) which would take weeks to get. We wanted to know we were close so we looked for a few more micro-optimizations and there were plenty. Enough that we now have a backlog of them as well.

Microsoft Edge (64-bit)
Iterations Scheduling Time Exec Time Comments
10 0.28ms 0.06ms Warm Up
100 0.22ms 0.46ms
1000 1.8ms 4.4ms
10000 16.5ms 70.5ms GC

We still have some sort of warm-up costs and we have a weird cliff at 10000 timers during execution where the GC appears to kick in. I can't do the same analysis for FireFox, but they also have some weird cliffs.

FireFox
Iterations Scheduling Time Exec Time
10 0.41ms 0.05ms
100 0.90ms 0.24ms
1000 7.75ms 2.48ms
10000 109ms 21.2ms

Scheduling they seem to be about 2x Chrome and Edge is now 0.5x Chrome (Caveat: My machine, Insider Build, etc...) They clearly have less overhead during execution of the callbacks though, generally coming in at 0.5x Chrome on that front. And they have something similar to Edge when they first start to run timers.

Conclusion


Hopefully this articles ties together all of our previous concepts of telemetry, data science and prioritization and gives you an idea of how we are picking the right things to work on. Preliminary telemetry from the build on which I collected the numbers above do indeed show that Microsoft Edge has a marked reduction in the time spent scheduling timers. This is goodness for everyone who uses Edge and authors who can now rely on good stable performance of the API even if they are slightly abusing it with 10k timers ;-)

In my teaser I noted some extreme improvements. Using the slightly rounded numbers above and the worst cases Edge is now 3-30x faster at scheduling timers depending on how many timers are involved. For executing timers we are anywhere from 3-140x faster.

With this release all modern browsers are now "close enough" that it doesn't matter anymore to a web page author. There is very little to be gained by further increasing the throughput of more empty timers in the system. Normally a timer will have some work to do and that work will dwarf the amount of time spent scheduling and executing them. If more speed is necessary, we have a set of backlog items that would squeeze out a bit more performance that we can always toss on the release. However, I recommend writing your own queue in JavaScript instead, it'll be much faster.

Sunday, February 21, 2016

Dissecting the HTML 5 Event Loop - Loops, Task Queues and Tasks

I'm hoping that in this article I can introduce the general audience of web developers to the concepts in HTML 5.1 called Event Loops (whatwg living standard, w3c old standard).

When taken in isolation event loops seem fairly natural. They are just a series of tasks, maybe categorized into groups and then executed in order. Many modern languages or task frameworks have these concepts baked deeply in now, allowing for some mixture of the concepts built into the HTML 5 event loops.

Let's start with a basic overview of the event loops section and then see if we can draw out some diagrams that roughly match what we see in the specification.

The first part of the specification tells us how integral event loops are. They are as important or as intrinsic to the system as are threads and processes to the Windows operating system.
To coordinate events, user interaction, scripts, rendering, networking, ... must use event loops ...
I like the choice of words here. First, note coordinate. That doesn't mean execute arbitrarily or even in order. It means exactly what it says, coordinate. This implies we may do some prioritization or reordering. This is actually key to browser performance but it is also one of the reasons for our differences or even differences on the same browser across different timing conditions.

We also get an idea of the types of tasks we'll be coordinating. Events, user interactions, scripts, etc... All of these are tasks in the system. They are clearly labelled in the specification as tasks. Well, sometimes, like in ES 6 we instead call them jobs and they can take on some variety which we'll get into later.

The final key bit is the must use piece at the end. This generally means that browsers won't agree if they go trying to decipher this section of the specification and come up with their own ideas. I will note there are numerous differences between the browsers today, mostly due to this spec and wording not even being in existence when the core event loops were written. I'm sure all 3 of Chrome, FireFox and Microsoft Edge had to bolt on HTML 5 event loops well before making them the core of their engine.
There are two kinds of event loops: those for browsing contexts, and those for workers.
Next we learn there are more than one kind of event loop. This is important. Likely these two types of event loops are going to share some basic concepts and then diverge, just a little bit, from one another in the more specific details.

The two kinds of event loops are for browsing contexts and workers. In our case browsing contexts are documents or page. I'll avoid their usage from now on. Just think of it as your HTML content. The reference to workers refers to normal dedicated workers, shared workers (deprecated, kinda) and service workers.
... at least one browsing context event loop ... at most one per unit of related similar-origin browsing contexts
This just tells us there is only one event loop and that event loops can be shared across browsing contexts of similar origin. I actually believe that most browsers substitute similar origin with thread. Such that all of your iframes, even in different origins, are in the same event loop. This is at least how Microsoft Edge and EdgeHTML work.
A browsing context event loop always has at least one browsing context. ... browsing contexts all go away, ... event loop goes away ...
This tells us that the lifetime of the event loop is tied to its list of browsing contexts. The event loop holds onto tasks and tasks hold onto other memory so this part is a bit of rational technical advice on how to shut down and clean up memory when all of the pages the user is viewing go away, aka closing the tab.

The next bit is on workers which we'll skip for now. Because workers don't have a one to many relationship between the event loop and the browsing contexts, they can use a more thread like lifetime model.
An event loop has one or more task queues. A task queue is an ordered list of tasks, which are algorithms that are responsible for such work as ...
Here we learn another concept called the task queue which we find there is one or more of hanging off of the event loop. It turns out the event loop does not execute tasks, instead it relies on a task queue for this instead. We learn that each task queue is ordered but we've not yet gotten a hint as to what this means. For now, let's assume insertion order is in play here since the specification has not said otherwise.

We also learn that a task is an algorithm responsible for work. The next bit of the specification will simply being listing the types of work and the recommended task queues. I'll introduce this instead through a diagram and then fall back to the spec text to further describe these units of work.


The HTML 5 specification tries to describe at least 5 task queues and units of work. It does leave off one critical piece which I'll add into the list as #6. Let's step through each and see how closely we can sync the spec to something you'd likely deal with every day.

  1. Events - An event might be a message event sent via postMessage. I generally refer to these as async events and Edge implements this using an sync event queue style approach.
  2. Parsing - These are hidden to the web developer for now except where these actions fire other synchronous events or in later specifications where the web developer can be part of the HTML parsing stack.
  3. Callbacks - A callback is generally a setTimeout, setInterval or setImmediate that is then dispatched from the event loop when its time is ready. requestAnimationFrame is also a callback, but executes as part of rendering, not as a core task.
  4. Using a Resource - These are generally download callbacks. At least that is how I read the specification here. This would be your progress, onload, onreadystatechange types of events. The spec refers to fetch here, which now uses Promises, so this may be a bug in the specification.
  5. DOM Manipulation - This task queue probably relates to DOM mutation events such as DOMAttrModified. I think most browsers fire these synchronously (not as tasks). Also, these are events, so I believe that in the case of Microsoft Edge these will fire in task queue 1.
  6. Input - This is now a task queue that I'm adding in. Input must be delivered in order so it belongs to its own queue. Also, the specification allows for prioritizing input over all other task queues to prevent input starvation.
One thing to note is that the specification is very loose. While it started strong with a bunch of musts and requirements for how browser's implement the loop it then gets very weak recommending "at least one" task queue. It then describes a set of task queues which really doesn't map to the full range of existing tasks that a browser has to deal with. I think this is a spec limitation that we should remedy since as a browser vendor and implementer it prevents me from implementing new features that are immediately interoperable with other browsers.

I'm going to end the dissection here and then continue later with details on how a browser is supposed to insert a task, what are task sources, and what data is associated with a task. This will probably dive through the full execution of such a task and so will also include the definition of and execution for micro-tasks and micro-task checkpoints. Fun stuff, I hope you are as excited as I am ;-)