Saturday, February 27, 2016

Improving Web Performance with Data Science and Telemetry

This post is about our ongoing commitment to using telemetry, data science, testing and solid engineering to achieve great results for the web. Over the course of two prior posts we've covered a 3 day hackathon in which we built DOM profiling data into the core of our browser to figure out how web authors really use our stuff. I then circled back on how we took the hack and built it into a world class telemetry pipeline, at web scale, so we could validate our local findings on the entire web ecosystem. Now I'm going to walk you through how we took a single, unexpected insight and turned it into a great product improvement.

The Insight

Insight #2: Our crawler data only had about a 60-70% overlap with our live data. This meant that what people do on the web changes quite a bit between their initial navigates and when they start to interact with the page. Our crawler was blind to big sites where people spend a lot of time and do a lot of interactions. All of those interactive scenarios were only "marginally" hit by the crawler.
This means that some APIs not on our performance optimization list started to jump up the list and became important for our team. We also started to extrapolate use cases from the data we were seeing. As an immediate example, APIs like setTimeout started to show up more since that is how dynamic pages are written. requestAnimationFrame was the same. All of the scheduling APIs moved up the list a bit when we considered the live data and presented differently than the crawler did. This was great news.
So this insight is taken from our follow-up on building the telemetry pipeline which could sample profile the entire population of EdgeHTML users to figure out the key APIs we should be focusing more testing, functionality and performance improvements on. We also hope to recommend API deprecation in the future as well, based on lack of hits on some set of APIs. That could prove huge.

How bad was setTimeout in terms of global usage? Well, it looked to be close to 15% of the time scripts were spending in the DOM (numbers are approximate). It was so heavyweight that it could be up to 2x the next closest API, which tends to be something related to layout such as offsetWidth or offsetHeight. This data was quite interesting but also very confusing. While the call counts were exceptionally high, there were other APIs that were also just as high. Also, the call counts on offsetWidth and offsetHeight were also no joke.

Once we knew that a problem existed it was time to figure out why and what we could do. There were two approaches, the first of which was to improve the telemetry. In addition to time spent in the API we decided to collect the timeout times and information about parameter usage. Our second approach was to write some low level tests and figure out the algorithmic complexity of our implementation vs other browsers. Was this a problem for us or was it a problem for everyone?

The Telemetry


Telemetry is never perfect on the first go round. You should always build models for your telemetry that take into account two things. First, you are going to change your telemetry and fix it. Hopefully very quickly. This will change your schema which leads to the second thing to take into account. Normalization across schemas and a description of your normalization process is critical if you want to demonstrate change and/or improvement. This is most telemetry, but not all. You could be doing a one shot to answer a question and not care about historical or even future data once you've answered your question. I find this to be fairly and if I build temporary telemetry I tend to replace it later with something that will stand the test of time.

The first bit of telemetry we tried to collect was the distribution of time ranges. The usage of 0, 1-4, 4-8, etc... to figure out how prominent each of the timeout and intervals ranges was going to be. This got checked in relatively early in the process, in fact I think it may be having a birthday soon. This told us that storage for timers needed to be optimized to handle a LOT of short term timers, meaning they would be inserted and removed from the timeout list quite often.

Once we started to see a broad range in the call times to setTimeout though we wanted to know something else. The API takes a string or function. A function doesn't have to be parsed, but a string does. We optimize this already to parse on execute, so we knew that string vs function wasn't causing us any differences in time to call setTimeout for that reason, however, the size of the structure to hold these different callback types was impacted, as was the time it takes to stringify something from JavaScript into a more persistent string buffer for parsing.

We also, per spec, allow you to pass arguments to the callback. These values have to be stored, maintained by GC relationships to the timeout, etc... This is yet more space. We knew that our storage was sub-optimal. We could do better, but it would cost several days of work. This became another telemetry point for us, the various ranges of arguments.

This telemetry is likely in your build of Windows if you are an Insider running RS 1. So you are helping us make further improvements to these APIs going forward. Telemetry in this case is about prioritization of an unlimited amount of work we could do based on real world usage. It helps ensure that customers get the best experience possible even if they don't understand the specifics of how browsers implement various HTML 5 APIs.

The Tests


Sometimes it is just faster to write a test and explore the space. After an hour of brainstorming and guessing (yes computer scientists like to imagine how the code should work and then make guesses about why it doesn't work that way, it is one of our most inefficient flaws ;-) we decided to write some tests. This is where Todd Reifsteck (@ToddReifsteck) stepped in and built a very simple, but compelling test to show the queueing and execution overhead across increasingly large workloads. Original GitHub File or run it from RawGIT here, though I've embedded it as a Gist below.

This test showed that Internet Explorer was using a back-end that demonstrated quadratic performance as the number of timers grew. EdgeHTML inherited this same implementation and though we knew about the problem, it had been reported by several customers, it was not actually a problem on the web. It was more of a website bug that led to very large numbers of timers being present and eventually the browser slowed down.

More telling though was that even at very low numbers EdgeHTML was some ways from Chrome and FireFox. This meant that not just our storage issues were in question, but we also had some nasty overhead.

The Numbers


When we first collected the numbers it was very obvious that something needed to be fixed. The legacy implementation had clearly been optimized for a small number of timeouts and our algorithms had significant warm-up costs associated with the 10 timer loop that weren't present in other browsers.

Internet Explorer/Microsoft Edge (Before)
Iterations Scheduling Time Execution Time Comments
10 8.5ms 8.5ms Warm Up
100 0.7ms 1.2ms
1000 6.4ms 28ms
10000 322ms 1666ms

Chrome (32-bit)
Iterations Scheduling Time Exec Time
10 0.01ms 0.01ms
100 0.39ms 0.5ms
1000 3.6ms 5.3ms
10000 38ms 45ms

From the list you can clearly see that Chrome has pretty much a linear ramp in both scheduling and execution. They also have no warm-up associated costs and seems almost ridiculously fast for the 10 iterations case.

It turns out that the quadratic algorithm was impacting both scheduling and execution. It was precisely the same helper code that was being called in both circumstances. Eliminating this code would be a huge win. We also had more than one quadratic algorithm so there was a hefty constant (you normally ignore the 2 in 2n^2, but in this case it was important, since to get not n^2 both of the algorithms had to be removed).

Removing those and we were within the ballpark of Chrome before we accounted for our compiler optimizations (PGO or Profile Guided Optimization) which would take weeks to get. We wanted to know we were close so we looked for a few more micro-optimizations and there were plenty. Enough that we now have a backlog of them as well.

Microsoft Edge (64-bit)
Iterations Scheduling Time Exec Time Comments
10 0.28ms 0.06ms Warm Up
100 0.22ms 0.46ms
1000 1.8ms 4.4ms
10000 16.5ms 70.5ms GC

We still have some sort of warm-up costs and we have a weird cliff at 10000 timers during execution where the GC appears to kick in. I can't do the same analysis for FireFox, but they also have some weird cliffs.

FireFox
Iterations Scheduling Time Exec Time
10 0.41ms 0.05ms
100 0.90ms 0.24ms
1000 7.75ms 2.48ms
10000 109ms 21.2ms

Scheduling they seem to be about 2x Chrome and Edge is now 0.5x Chrome (Caveat: My machine, Insider Build, etc...) They clearly have less overhead during execution of the callbacks though, generally coming in at 0.5x Chrome on that front. And they have something similar to Edge when they first start to run timers.

Conclusion


Hopefully this articles ties together all of our previous concepts of telemetry, data science and prioritization and gives you an idea of how we are picking the right things to work on. Preliminary telemetry from the build on which I collected the numbers above do indeed show that Microsoft Edge has a marked reduction in the time spent scheduling timers. This is goodness for everyone who uses Edge and authors who can now rely on good stable performance of the API even if they are slightly abusing it with 10k timers ;-)

In my teaser I noted some extreme improvements. Using the slightly rounded numbers above and the worst cases Edge is now 3-30x faster at scheduling timers depending on how many timers are involved. For executing timers we are anywhere from 3-140x faster.

With this release all modern browsers are now "close enough" that it doesn't matter anymore to a web page author. There is very little to be gained by further increasing the throughput of more empty timers in the system. Normally a timer will have some work to do and that work will dwarf the amount of time spent scheduling and executing them. If more speed is necessary, we have a set of backlog items that would squeeze out a bit more performance that we can always toss on the release. However, I recommend writing your own queue in JavaScript instead, it'll be much faster.

No comments:

Post a Comment