Friday, December 25, 2015

Web API and Feature Usage - From Hackathon to Production

I wanted to provide some details on how a short 3 day data science excursion has lead to increased insights for myself, my team and eventually for the web itself.

While my original article focused on hardships we faced along the way, this article will focus more on two different topics that take much longer than 3 days to answer. The first topic is around how you take your telemetry and deliver it at web scale and production quality. You can see older articles from Ars Technica that have Windows 10 at 110 million installs back in October. That is a LOT of scale. The second topic I want to discuss are the insights that we can gather after the data has been stripped of PII (personally identifiable information).

I'll start with a quick review of things we had prior to Windows 10, things we released with Windows 10 and then of course how we released DOM API profiling in our most recent release. This later bit is the really interesting part for me since it is the final form of my Hackathon project (though in full transparency, the planning for the DOM telemetry projected preceded my hackathon by a couple of months ;-)

Telemetry over the Years

The concepts of getting telemetry to determine how you are doing in the wild is nothing new. And Web Browsers, Operating Systems and many other applications have been doing it for a long time. The largest scale telemetry effort (and probably the oldest) on Windows is likely still Watson. We leverage Watson to gain insights into application and OS reliability and to focus our efforts on finding and fixing newly introduced, crashing and memory related bugs.

For the browser space, Chrome has been doing something with use counters for a while. These are great, lightweight boolean flags that get set on a page and then recorded as a roll-up. This can tell you, across some set of navigations to various pages, whether or not certain APIs are hit. An API, property or feature may or may not be hit depending on user interaction, flighting, the user navigating early, which ads load, etc... So you have to rely on large scale statistics for smoothing, but overall this is pretty cool stuff that you can view on the chrome status webpage.

FireFox has recently built something similar and while I don't know exactly what they present, you can view it for yourself on their telemetry tracking page as well.

For Microsoft Edge, our telemetry over the years has been very nuanced. We started with a feature called SQM that allowed us to aggregate user information if they opted into our privacy policies. This let us figure out how many tabs you use on average, which UI features were getting clicks and a small set of other features. These streams were very limited in the amount of data we could send and so we were careful not to send up too much.

With Windows 10 we started to lean more on a new telemetry system which is based on ETW (Event Tracing for Windows) which gives us very powerful platform that we were already familiar with to log events not just in our application, but across the system. The major improvements made here were how we extended our existing page load and navigation timings so that we could detect very quickly whether or not we had a performance problem on the web without having to wait for users to file a bug for a site and then up-vote it using our various bug reporting portals.

Just doing something we already had in our back pockets from previous releases would have been boring though, so we decided that a boolean flag based structure, logged per navigation would also give us a lot of extra data that we could use to determine feature popularity within the browser itself. While annotating every DOM API would be overkill for such an effort, given there are 6300 of them in our case, of which nearly 3000 are in general usage on the web, we instead saved this for larger feature areas and for exploring APIs in depth. This functionality shipped in our initial release of Windows 10 and we've been steadily adding more and more telemetry points to this tracker. Many of which are mirrored in the Chrome data, but many of which are more specific to our own operations or around features that we might want to try and optimize in the future.

This puts the web in a great place. At any given point in time you have hundreds if not thousands of active telemetry points and metrics being logged by all of the major browser vendors, aggregating and gaining insight across the entire web (not just a single site and not just what is available to the web site analytics scripts) and being shared and used in the standards process to help us better design and build features.

Building a Web Scale Telemetry System

I don't think a lot of people understand web scale. In general we have trouble, as humans, with large numbers. Increasingly larger sequences tend to scale much more slowly in our minds than in reality. My favorite book on the subject currently escapes my mind, but once I'm back at my desk at home I'll dig it out and share it with everyone.

So what does web scale mean? Well, imagine that Facebook is serving up 400 million users worth of information a day and imagine that they account for say, 5% of the web traffic. These are completely made up numbers. I could go look up real numbers, but let's just hand wave. Now, imagine that Internet Explorer and Microsoft Edge have about 50% of the desktop market share (again made up, please don't bust me!) and that accounts for about 1.2 billion users.

So Facebook's problem is in that they have to scale to deliver data to 400 million users, but a lot more than say our telemetry would be, and they account for 5% of all navigations. Let's play with these numbers a bit and see how they compare to the browser itself. Instead of 400 million users, let's say we are at 600 million (half of that 1.2 billion person market). Instead of 5% of page navigations we are logging 100% of them. This would put us at roughly 30x more telemetry data points having to be managed than Facebook having to manage a day's worth of responses to their entire user base. This is just a beginning of a web scale feature. We don't get the luxury of there being millions of sites that get to distribute the load, instead all of the data from all of those users has to slowly hit our endpoints and its very unique, uncacheable data.

Needless to say, we don't upload all of it, nor can we. You can opt out of data collection and many people do. We will also restrict our sampling groups for various reasons to limit the data feeds to an amount that can be managed. But imagine there is a rogue actor in the system, maybe a web browser that is new that doesn't have years of refinement on its telemetry points. You could imagine such a browser over logging. In fact, to be effective in the early times where your browser is being rolled out, you have to log more and more often to get enough data points while you build your user base up. You want the browser to be able to log a LOT of data and then for that data to be restricted by policies based on the amount and types of data being received before going up to the cloud.

Cool, so that is a very long winded way to introduce my telemetry quality or LOD meter. Basically, when you are writing telemetry, how much thought/work should you put into it based on the audience that it will be going to and of what size is that audience. As you scale up and you want to roll out to say, every Windows 10 user, then something as innocuous as a string formatting swprint might have to be rethought and reconsidered. The following figure shows, for my team, what our considerations are when we think about who we will be tapping to provide our data points for us ;-)


I can also accompany this with a simple table that maps the target audience to the various points that change as you slide from the left of the scale to the right. From left to right the columns are:

  • Audience - From the figure above who is going to be running the bits and collecting data.
  • Code Quality - How much thought has to be put into the code quality?
  • Code Performance - How much performance impact can the code have?
  • Output Formatting - How verbose can the data be and how can it be logged?
  • Logging Data Size - How much data can be logged?
  • Logging Frequency - Can you log every bit of data or do you have to start sampling?

Note many of these are interrelated. By changing your log data size, you might be able to increase your frequency. In production, you tend to find that the answer is, "yes, optimize all of these" in order to meet the desired performance and business requirements. Also note that as you get to release, the columns are somewhat additive. You would do all of the Beta oriented enhancements for Release as well as those specified for the Release code. Here is the table with some of my own off the cuff figures.

Audience Quality Performance Output Log Size Log Frequency
Developer Hack Debug Perf Strings and Files GBs Every Call
Team Doesn't Crash Debug Perf Log Files GBs Every Call
Internal Reviewed 1.2-1.5x Log Files/CSV GBs Aggregated
Beta Reviewed & Tested 1.1x String Telemetry <MBsAggregated+Compressed
Release Optimized ~1x Binary Telemetry <5KBAgg+Sampled+Compressed

Hopefully it is clear from the table that to build a developer hack into your system you can get away with murder. You can use debug builds with debug performance (some debug builds of some products can be greater than 2x-5x slower than their retail counterparts) using debug string formatting and output, maybe write it to a file, but maybe just use OutputDebugString. You can log gigs of data and you can most importantly log everything, every call, full fidelity. In this mode you'll do data aggregation later. 

The next interesting stage is internal releases. This might be to a broader team of individuals and it may also include using web crawlers, performance labs, test labs, etc... to exercise the code in question. Here we have to be more mindful of performance, the code needs to have a review on it to find stupid mistakes, and you really need to start collecting data in a well formatted manner. At this point, raw logs start to become normalized CSVs and data tends to be aggregated by the code before writing to the logs to save a bit more on the output size. You can still log gigs of data or more at this point though, assuming you can process all of it. You also probably want to only enable the logging when requested, for instance via an environment variable or by turning on a trace logging provider (ETW again, if you didn't follow that link you should, ETW is our preferred way of building these architectures into Windows).

Depending on your scale, Beta and Release may have the same requirements. For us they tend to, since our beta size is in the millions of users and most beta users tend to enable our telemetry so they can give us that early feedback we need. Some companies ship debug builds to beta users though, so at this point you are just trying to be respectful of the end users machine itself. You don't want to stores gigs of log data, you don't want to upload uncompressed data. You may choose not to upload binary data at this point though. In fact having it in viewable format for the end user to see can be a good thing. Some users appreciate that. Others don't, but hey, you can't make everyone happy.

Finally when you release, you have to focus on highly optimized code. At this point your telemetry should have as close to 0 as possible on the performance marks for your application. Telemetry is very important to a product, but so is performance, so finding your balance is important. In a browser, we have no spare time to go collecting data, so we optimize both at the memory level and the CPU level to make sure we are sipping at the machine's resources and leaving the rest to the web page. You'll generally want to upload binary telemetry, in small packets, highly compressed. You'd be surprised what you can do with 5KB for instance. We can upload an entire page's worth of sampled DOM profiler information on most pages. More complex pages will take more, but that is where the sampling can be further refined. I'll talk a bit about some of these considerations now.

DOM Telemetry

Okay, so for DOM information how can we turn the knobs, what is available, what can we log? We decided that a sampled profiler would be best. This instruments the code so that some small set of calls as we choose will have timing information taken as part of the call. The check for whether or not we should log needs to be cheap as does the overhead of logging the call. We also want some aggregation since we know there are going to be thousands of samples that we'll take and we only want to upload those samples if we aren't abusing the telemetry pipeline.

A solution that used a circular logging buffer + a sampled API count with initial random offset was sufficient for our cases. I apologize for slamming you with almost ten optimizations in that one sentence, but I didn't feel going through the entire decision tree that we did would be useful. This is the kind of feature that can take longer to design than code ;-)

Let's start with sampling. We built this into the accumulator itself. This meant that any CPU overhead with logging could be eliminated whenever we weren't sampling (rather than having sampling accept or reject data at a higher level). Our sampling rate was a simple counter. Something like 1 in every 1000 or 1 in every 10000. By tweaking the number to 1, we could log every call if we wanted to making it a call attributed profiler. From my hack I did build a call attributed profiler instead, since I wanted more complete data and my collection size was a small set of machines. The outcome of that effort though showed to do it right you would need to aggregate, which we aren't doing in this model. Aggregation can cost CPU and we can defer that cost to the cloud in our case!

With a simple counter and mod check we can now know if we have to log. To avoid a bias against the first n-1 samples, we start our counter with a random offset. That means we might log the first call, the 50th call, whatever, but then from there it is spaced by our sampling interval. These are some of the tricks you have to use when using sampling otherwise you might miss things like bootstrapping code if you ALWAYS skip the first sampling interval of values.

When logging we take the start and end QPCs (QueryPerformanceCounter values), do the math and then log the values. On 64-bit, we can submit a function pointer (this is the DOM function) the QPC delta and a few flags bits to the circular buffer and continue on. We don't even bother decoding the function pointers until we are in the cloud where we marry the data with symbols. I can't recall but we also decided at some point that we would send down the value of the QueryPerformanceFrequency down in the payload so we could do the math on that in the cloud as well. We might have decided against that in the end, but you can clearly see the lengths we go to when thinking about how much CPU we use on the client's machine.

The next knob we have is the circular buffer size and the logging frequency. We allow ourselves to log the first buffer during a navigation and then 1 more buffer every minute. If the buffer isn't full we log a partial buffer. If the buffer overflows then we simply lose samples. We never lose samples in the initial navigation buffer since we always commit it when its ready and then put ourselves on a future logging diet.

Once this data hits the Windows telemetry service, it gets to decide if this users is opted into this type of logging. So we MIGHT in some cases be tracking things that don't make it up to us. We do try and detect this beforehand, but we can't always do this. There are things like throttling that would decide if buffer should go up or no. Once we hit production, which we did in our first Windows 10 update release back, then scale kicks in and you don't even concern yourself with the missing data because you have WAY TOO MUCH data to deal with already!

The Windows telemetry pipeline also controls for many other variables which I'm not tuned into. There is an entire data science team which knows how to classify the machines, the users, the locale, and a bunch of other information from each Windows machine and then those become pivots that we can sometimes get in our data. We can certainly get details on domains and URLs once we have enough samples (to anonymize the data there must be a sufficient number of samples, otherwise we can end up seeing PII without realizing it).

Okay, I'm starting to get into the data itself so let's take a look at some of the insights this effort has brought to our attention!

Web API Insights

There are two schools of thought in data science. Ask a specific question and then attempt to answer it with some telemetry. This is a very focused approach, it often yields results, but it rarely creates new questions or allows for deep insight. For the second approach, when we think about "big data" as opposed to "data science" we start to think about how our raw data has deeply buried patterns and insights for us to go glean. Its rarely this clean, but there are indeed patterns in that raw data, and if you have enough of it, you definitely start asking more questions ;-) Our second school of thought then wants to add telemetry to many things with no specific question, then process the data and see if anything pops out.

Our Web API telemetry design is both. First, we did have some very specific questions and our questions were around things like, "What are the top 10 DOM APIs by usage?" and "What are the top 10 DOM APIs by total exclusive time?". These are usage and performance questions. We didn't start by thinking about other questions though like, "What APIs are the top 10 websites using today that is different from 3 months ago?" How could we ask a time oriented question requiring many data points without having the first data point? Well, by collecting more raw data that didn't have specific questions in mind just yet, we can ask some of those questions later, historically if you will, and we can run algorithms to find patterns once we have additional data.

One of our biggest outcomes from the hackathon data was using a clustering algorithm to cluster the sites into 10 categories based on their API usage. Would you have guessed that 700 websites out of the top 1000 would be categorized and appear similar to one another? I wouldn't have.

Here are some insights that we were able to derive. I'm, unfortunately, anonymizing this a little bit but hopefully in the future we'll be able to broadly share the data similar to how Chrome and FireFox are doing through their telemetry sites.

Insight #1: Upon initial release of our feature, we found that our numbers were heavily skewed towards URL's in a specific country that we didn't expect to be super high. We found, using this method, an indirect correlation between upgrade cadence and country. After a couple of weeks this completely evaporated from our data and we started to see the site distribution that we more traditionally expected.

Insight #2: Our crawler data only had about a 60-70% overlap with our live data. This meant that what people do on the web changes quite a bit between their initial navigates and when they start to interact with the page. Our crawler was blind to big sites where people spend a lot of time and do a lot of interactions. All of those interactive scenarios were only "marginally" hit by the crawler.

This means that some APIs not on our performance optimization list started to jump up the list and became important for our team. We also started to extrapolate use cases from the data we were seeing. As an immediate example, APIs like setTimeout started to show up more since that is how dynamic pages are written. requestAnimationFrame was the same. All of the scheduling APIs moved up the list a bit when we considered the live data and presented differently than the crawler did. This was great news.

Insight #3: Even though I just talked down the crawler, it turns out, it isn't THAT far off. Since we know its shortcomings we can also account for them. We use the crawler to validate the live data (does it make sense?) and we use the live data to validate the crawler (is it still representative of the real world). Having two different ways to get the same data to cross validate is a huge bonus when doing any sort of data science projects.

Insight #4: The web really needs to think about deprecation of APIs moving forward. The power of the web is becoming the ability for the run-time and language to adopt to new programming trends in months rather than years. This has the downside of leading to a bloated API set. When APIs are no longer used by the web we could try to work towards their deprecation and eventual removal. Given the use trackers of Chrome, FireFox and Microsoft Edge this can become more than just a hope. If we consider that Internet Explorer is supporting the legacy web on Windows platforms and filling that niche roll of keeping the old web working I see even more hope.

What we classically find is that something like half of the web API is even used. Removing APIs would improve perf, shrink browser footprint and make space for newer APIs that do what web developers actually want them to do.

Insight #5: My final insight is one that we are only beginning to realize. We are collecting data over time. FireFox has an evolution dashboard on their site and hear I'm linking one where they explore, I think, UI lags in the event loop and how that changes over time.

Why do overtime metrics matter for the browser? Well, by watching for usage trends we can allocate resources towards API surface area that will need it most in the future. Or we can focus more on specifications that extend the areas where people are broadly using the API set. A great example would be monitoring adoption of things like MSE or Media Source Extensions and whether or not the browser is supporting the media APIs necessary to deliver high quality experiences.

We can also determine if architectural changes have materially impacted performance either to the positive or negative. We've been able to "see" this in some of our data though the results are inconclusive since we have too few data points currently. By logging API failures we can take this a step further and even find functional regressions if the number of failures increases dramatically say between two releases. We don't yet have an example of this, but it will be really cool when it happens.

Conclusions

After re-reading, there is a LOT of information to digest in this article. Just the section on the telemetry LOD could be its own article. The Web API data, I'm sure, will be many more articles to come as well. We should be able to make this available for standards discussions in the near future if we haven't already been using it in that capacity.

The most stand-out thought for me as a developer was that going from Hackathon to Production was a long process, but not nearly as long as I thought it would be. I won't discount the amount of work that everyone had to put in to make it happen, but we are talking about a dev month or two, not dev years. The outcome from the project will certainly drive many dev years worth of improvements, so in terms of cost/benefit it is definitely a positive feature.

Contrasting this with work that I did to instrument a few APIs with use tracker before we had this profiler available, I would say the general solution came out to be much, much cheaper. That doesn't mean everything can be generally solved. In fact, my use tracker for the APIs does more than just log timing information. It also handles the parameters passed in to give us more insight into how the API is being used.

In both cases adding telemetry was pretty easy though. And that is the key to telemetry in your product. It should be easy to add, easy to remove and developers should be aware of it. If you have systems in place from the beginning to collect this data, then your developers will use it. If you don't have the facilities then developers may or may not write it themselves, and can certainly write it very poorly. As your product grows you will experience telemetry growing pains. You'll certainly wish you had designed telemetry in from the start ;-) Hopefully some of the insights here can help you figure out what level of optimization, logging, etc... would be right for your project.

Credits

I'd like to provide credit here to the many people who ended up helping with my efforts in this space. I'll simply list first names, but I will contact each of them individually and back fill their twitter accounts or a link to a blog if they want.

Credit for the original design and driving the feature goes to my PM Todd Reifsteck (@ToddReifsteck) and our expert in Chakra who built the logging system, Arjun.

Credit for all of the work to mine the data goes mainly to one of our new team members Brandon. After a seamless hand-off of the data stream from Chakra we have then merged it with many other data streams to come up with the reports we are able to use now to drive all of the insights above.

No comments:

Post a Comment