Showing posts with label OneHack. Show all posts

Friday, December 25, 2015

Web API and Feature Usage - From Hackathon to Production

I wanted to provide some details on how a short 3 day data science excursion has lead to increased insights for myself, my team and eventually for the web itself.

While my original article focused on hardships we faced along the way, this article will focus more on two different topics that take much longer than 3 days to answer. The first topic is around how you take your telemetry and deliver it at web scale and production quality. You can see older articles from Ars Technica that have Windows 10 at 110 million installs back in October. That is a LOT of scale. The second topic I want to discuss are the insights that we can gather after the data has been stripped of PII (personally identifiable information).

I'll start with a quick review of things we had prior to Windows 10, things we released with Windows 10 and then of course how we released DOM API profiling in our most recent release. This later bit is the really interesting part for me since it is the final form of my Hackathon project (though in full transparency, the planning for the DOM telemetry projected preceded my hackathon by a couple of months ;-)

Telemetry over the Years

The concepts of getting telemetry to determine how you are doing in the wild is nothing new. And Web Browsers, Operating Systems and many other applications have been doing it for a long time. The largest scale telemetry effort (and probably the oldest) on Windows is likely still Watson. We leverage Watson to gain insights into application and OS reliability and to focus our efforts on finding and fixing newly introduced, crashing and memory related bugs.

For the browser space, Chrome has been doing something with use counters for a while. These are great, lightweight boolean flags that get set on a page and then recorded as a roll-up. This can tell you, across some set of navigations to various pages, whether or not certain APIs are hit. An API, property or feature may or may not be hit depending on user interaction, flighting, the user navigating early, which ads load, etc... So you have to rely on large scale statistics for smoothing, but overall this is pretty cool stuff that you can view on the chrome status webpage.

FireFox has recently built something similar and while I don't know exactly what they present, you can view it for yourself on their telemetry tracking page as well.

For Microsoft Edge, our telemetry over the years has been very nuanced. We started with a feature called SQM that allowed us to aggregate user information if they opted into our privacy policies. This let us figure out how many tabs you use on average, which UI features were getting clicks and a small set of other features. These streams were very limited in the amount of data we could send and so we were careful not to send up too much.

With Windows 10 we started to lean more on a new telemetry system which is based on ETW (Event Tracing for Windows) which gives us very powerful platform that we were already familiar with to log events not just in our application, but across the system. The major improvements made here were how we extended our existing page load and navigation timings so that we could detect very quickly whether or not we had a performance problem on the web without having to wait for users to file a bug for a site and then up-vote it using our various bug reporting portals.

Just doing something we already had in our back pockets from previous releases would have been boring though, so we decided that a boolean flag based structure, logged per navigation would also give us a lot of extra data that we could use to determine feature popularity within the browser itself. While annotating every DOM API would be overkill for such an effort, given there are 6300 of them in our case, of which nearly 3000 are in general usage on the web, we instead saved this for larger feature areas and for exploring APIs in depth. This functionality shipped in our initial release of Windows 10 and we've been steadily adding more and more telemetry points to this tracker. Many of which are mirrored in the Chrome data, but many of which are more specific to our own operations or around features that we might want to try and optimize in the future.

This puts the web in a great place. At any given point in time you have hundreds if not thousands of active telemetry points and metrics being logged by all of the major browser vendors, aggregating and gaining insight across the entire web (not just a single site and not just what is available to the web site analytics scripts) and being shared and used in the standards process to help us better design and build features.

Building a Web Scale Telemetry System

I don't think a lot of people understand web scale. In general we have trouble, as humans, with large numbers. Increasingly larger sequences tend to scale much more slowly in our minds than in reality. My favorite book on the subject currently escapes my mind, but once I'm back at my desk at home I'll dig it out and share it with everyone.

So what does web scale mean? Well, imagine that Facebook is serving up 400 million users worth of information a day and imagine that they account for say, 5% of the web traffic. These are completely made up numbers. I could go look up real numbers, but let's just hand wave. Now, imagine that Internet Explorer and Microsoft Edge have about 50% of the desktop market share (again made up, please don't bust me!) and that accounts for about 1.2 billion users.

So Facebook's problem is in that they have to scale to deliver data to 400 million users, but a lot more than say our telemetry would be, and they account for 5% of all navigations. Let's play with these numbers a bit and see how they compare to the browser itself. Instead of 400 million users, let's say we are at 600 million (half of that 1.2 billion person market). Instead of 5% of page navigations we are logging 100% of them. This would put us at roughly 30x more telemetry data points having to be managed than Facebook having to manage a day's worth of responses to their entire user base. This is just a beginning of a web scale feature. We don't get the luxury of there being millions of sites that get to distribute the load, instead all of the data from all of those users has to slowly hit our endpoints and its very unique, uncacheable data.

Needless to say, we don't upload all of it, nor can we. You can opt out of data collection and many people do. We will also restrict our sampling groups for various reasons to limit the data feeds to an amount that can be managed. But imagine there is a rogue actor in the system, maybe a web browser that is new that doesn't have years of refinement on its telemetry points. You could imagine such a browser over logging. In fact, to be effective in the early times where your browser is being rolled out, you have to log more and more often to get enough data points while you build your user base up. You want the browser to be able to log a LOT of data and then for that data to be restricted by policies based on the amount and types of data being received before going up to the cloud.

Cool, so that is a very long winded way to introduce my telemetry quality or LOD meter. Basically, when you are writing telemetry, how much thought/work should you put into it based on the audience that it will be going to and of what size is that audience. As you scale up and you want to roll out to say, every Windows 10 user, then something as innocuous as a string formatting swprint might have to be rethought and reconsidered. The following figure shows, for my team, what our considerations are when we think about who we will be tapping to provide our data points for us ;-)

I can also accompany this with a simple table that maps the target audience to the various points that change as you slide from the left of the scale to the right. From left to right the columns are:

Audience - From the figure above who is going to be running the bits and collecting data.
Code Quality - How much thought has to be put into the code quality?
Code Performance - How much performance impact can the code have?
Output Formatting - How verbose can the data be and how can it be logged?
Logging Data Size - How much data can be logged?
Logging Frequency - Can you log every bit of data or do you have to start sampling?

Note many of these are interrelated. By changing your log data size, you might be able to increase your frequency. In production, you tend to find that the answer is, "yes, optimize all of these" in order to meet the desired performance and business requirements. Also note that as you get to release, the columns are somewhat additive. You would do all of the Beta oriented enhancements for Release as well as those specified for the Release code. Here is the table with some of my own off the cuff figures.

Audience	Quality	Performance	Output	Log Size	Log Frequency
Developer	Hack	Debug Perf	Strings and Files	GBs	Every Call
Team	Doesn't Crash	Debug Perf	Log Files	GBs	Every Call
Internal	Reviewed	1.2-1.5x	Log Files/CSV	GBs	Aggregated
Beta	Reviewed & Tested	1.1x	String Telemetry	<MBs	Aggregated+Compressed
Release	Optimized	~1x	Binary Telemetry	<5KB	Agg+Sampled+Compressed

Hopefully it is clear from the table that to build a developer hack into your system you can get away with murder. You can use debug builds with debug performance (some debug builds of some products can be greater than 2x-5x slower than their retail counterparts) using debug string formatting and output, maybe write it to a file, but maybe just use OutputDebugString. You can log gigs of data and you can most importantly log everything, every call, full fidelity. In this mode you'll do data aggregation later.

The next interesting stage is internal releases. This might be to a broader team of individuals and it may also include using web crawlers, performance labs, test labs, etc... to exercise the code in question. Here we have to be more mindful of performance, the code needs to have a review on it to find stupid mistakes, and you really need to start collecting data in a well formatted manner. At this point, raw logs start to become normalized CSVs and data tends to be aggregated by the code before writing to the logs to save a bit more on the output size. You can still log gigs of data or more at this point though, assuming you can process all of it. You also probably want to only enable the logging when requested, for instance via an environment variable or by turning on a trace logging provider (ETW again, if you didn't follow that link you should, ETW is our preferred way of building these architectures into Windows).

Depending on your scale, Beta and Release may have the same requirements. For us they tend to, since our beta size is in the millions of users and most beta users tend to enable our telemetry so they can give us that early feedback we need. Some companies ship debug builds to beta users though, so at this point you are just trying to be respectful of the end users machine itself. You don't want to stores gigs of log data, you don't want to upload uncompressed data. You may choose not to upload binary data at this point though. In fact having it in viewable format for the end user to see can be a good thing. Some users appreciate that. Others don't, but hey, you can't make everyone happy.

Finally when you release, you have to focus on highly optimized code. At this point your telemetry should have as close to 0 as possible on the performance marks for your application. Telemetry is very important to a product, but so is performance, so finding your balance is important. In a browser, we have no spare time to go collecting data, so we optimize both at the memory level and the CPU level to make sure we are sipping at the machine's resources and leaving the rest to the web page. You'll generally want to upload binary telemetry, in small packets, highly compressed. You'd be surprised what you can do with 5KB for instance. We can upload an entire page's worth of sampled DOM profiler information on most pages. More complex pages will take more, but that is where the sampling can be further refined. I'll talk a bit about some of these considerations now.

DOM Telemetry

Okay, so for DOM information how can we turn the knobs, what is available, what can we log? We decided that a sampled profiler would be best. This instruments the code so that some small set of calls as we choose will have timing information taken as part of the call. The check for whether or not we should log needs to be cheap as does the overhead of logging the call. We also want some aggregation since we know there are going to be thousands of samples that we'll take and we only want to upload those samples if we aren't abusing the telemetry pipeline.

A solution that used a circular logging buffer + a sampled API count with initial random offset was sufficient for our cases. I apologize for slamming you with almost ten optimizations in that one sentence, but I didn't feel going through the entire decision tree that we did would be useful. This is the kind of feature that can take longer to design than code ;-)

Let's start with sampling. We built this into the accumulator itself. This meant that any CPU overhead with logging could be eliminated whenever we weren't sampling (rather than having sampling accept or reject data at a higher level). Our sampling rate was a simple counter. Something like 1 in every 1000 or 1 in every 10000. By tweaking the number to 1, we could log every call if we wanted to making it a call attributed profiler. From my hack I did build a call attributed profiler instead, since I wanted more complete data and my collection size was a small set of machines. The outcome of that effort though showed to do it right you would need to aggregate, which we aren't doing in this model. Aggregation can cost CPU and we can defer that cost to the cloud in our case!

With a simple counter and mod check we can now know if we have to log. To avoid a bias against the first n-1 samples, we start our counter with a random offset. That means we might log the first call, the 50th call, whatever, but then from there it is spaced by our sampling interval. These are some of the tricks you have to use when using sampling otherwise you might miss things like bootstrapping code if you ALWAYS skip the first sampling interval of values.

When logging we take the start and end QPCs (QueryPerformanceCounter values), do the math and then log the values. On 64-bit, we can submit a function pointer (this is the DOM function) the QPC delta and a few flags bits to the circular buffer and continue on. We don't even bother decoding the function pointers until we are in the cloud where we marry the data with symbols. I can't recall but we also decided at some point that we would send down the value of the QueryPerformanceFrequency down in the payload so we could do the math on that in the cloud as well. We might have decided against that in the end, but you can clearly see the lengths we go to when thinking about how much CPU we use on the client's machine.

The next knob we have is the circular buffer size and the logging frequency. We allow ourselves to log the first buffer during a navigation and then 1 more buffer every minute. If the buffer isn't full we log a partial buffer. If the buffer overflows then we simply lose samples. We never lose samples in the initial navigation buffer since we always commit it when its ready and then put ourselves on a future logging diet.

Once this data hits the Windows telemetry service, it gets to decide if this users is opted into this type of logging. So we MIGHT in some cases be tracking things that don't make it up to us. We do try and detect this beforehand, but we can't always do this. There are things like throttling that would decide if buffer should go up or no. Once we hit production, which we did in our first Windows 10 update release back, then scale kicks in and you don't even concern yourself with the missing data because you have WAY TOO MUCH data to deal with already!

The Windows telemetry pipeline also controls for many other variables which I'm not tuned into. There is an entire data science team which knows how to classify the machines, the users, the locale, and a bunch of other information from each Windows machine and then those become pivots that we can sometimes get in our data. We can certainly get details on domains and URLs once we have enough samples (to anonymize the data there must be a sufficient number of samples, otherwise we can end up seeing PII without realizing it).

Okay, I'm starting to get into the data itself so let's take a look at some of the insights this effort has brought to our attention!

Web API Insights

There are two schools of thought in data science. Ask a specific question and then attempt to answer it with some telemetry. This is a very focused approach, it often yields results, but it rarely creates new questions or allows for deep insight. For the second approach, when we think about "big data" as opposed to "data science" we start to think about how our raw data has deeply buried patterns and insights for us to go glean. Its rarely this clean, but there are indeed patterns in that raw data, and if you have enough of it, you definitely start asking more questions ;-) Our second school of thought then wants to add telemetry to many things with no specific question, then process the data and see if anything pops out.

Our Web API telemetry design is both. First, we did have some very specific questions and our questions were around things like, "What are the top 10 DOM APIs by usage?" and "What are the top 10 DOM APIs by total exclusive time?". These are usage and performance questions. We didn't start by thinking about other questions though like, "What APIs are the top 10 websites using today that is different from 3 months ago?" How could we ask a time oriented question requiring many data points without having the first data point? Well, by collecting more raw data that didn't have specific questions in mind just yet, we can ask some of those questions later, historically if you will, and we can run algorithms to find patterns once we have additional data.

One of our biggest outcomes from the hackathon data was using a clustering algorithm to cluster the sites into 10 categories based on their API usage. Would you have guessed that 700 websites out of the top 1000 would be categorized and appear similar to one another? I wouldn't have.

Here are some insights that we were able to derive. I'm, unfortunately, anonymizing this a little bit but hopefully in the future we'll be able to broadly share the data similar to how Chrome and FireFox are doing through their telemetry sites.

Insight #1: Upon initial release of our feature, we found that our numbers were heavily skewed towards URL's in a specific country that we didn't expect to be super high. We found, using this method, an indirect correlation between upgrade cadence and country. After a couple of weeks this completely evaporated from our data and we started to see the site distribution that we more traditionally expected.

Insight #2: Our crawler data only had about a 60-70% overlap with our live data. This meant that what people do on the web changes quite a bit between their initial navigates and when they start to interact with the page. Our crawler was blind to big sites where people spend a lot of time and do a lot of interactions. All of those interactive scenarios were only "marginally" hit by the crawler.

This means that some APIs not on our performance optimization list started to jump up the list and became important for our team. We also started to extrapolate use cases from the data we were seeing. As an immediate example, APIs like setTimeout started to show up more since that is how dynamic pages are written. requestAnimationFrame was the same. All of the scheduling APIs moved up the list a bit when we considered the live data and presented differently than the crawler did. This was great news.

Insight #3: Even though I just talked down the crawler, it turns out, it isn't THAT far off. Since we know its shortcomings we can also account for them. We use the crawler to validate the live data (does it make sense?) and we use the live data to validate the crawler (is it still representative of the real world). Having two different ways to get the same data to cross validate is a huge bonus when doing any sort of data science projects.

Insight #4: The web really needs to think about deprecation of APIs moving forward. The power of the web is becoming the ability for the run-time and language to adopt to new programming trends in months rather than years. This has the downside of leading to a bloated API set. When APIs are no longer used by the web we could try to work towards their deprecation and eventual removal. Given the use trackers of Chrome, FireFox and Microsoft Edge this can become more than just a hope. If we consider that Internet Explorer is supporting the legacy web on Windows platforms and filling that niche roll of keeping the old web working I see even more hope.

What we classically find is that something like half of the web API is even used. Removing APIs would improve perf, shrink browser footprint and make space for newer APIs that do what web developers actually want them to do.

Insight #5: My final insight is one that we are only beginning to realize. We are collecting data over time. FireFox has an evolution dashboard on their site and hear I'm linking one where they explore, I think, UI lags in the event loop and how that changes over time.

Why do overtime metrics matter for the browser? Well, by watching for usage trends we can allocate resources towards API surface area that will need it most in the future. Or we can focus more on specifications that extend the areas where people are broadly using the API set. A great example would be monitoring adoption of things like MSE or Media Source Extensions and whether or not the browser is supporting the media APIs necessary to deliver high quality experiences.

We can also determine if architectural changes have materially impacted performance either to the positive or negative. We've been able to "see" this in some of our data though the results are inconclusive since we have too few data points currently. By logging API failures we can take this a step further and even find functional regressions if the number of failures increases dramatically say between two releases. We don't yet have an example of this, but it will be really cool when it happens.

Conclusions

After re-reading, there is a LOT of information to digest in this article. Just the section on the telemetry LOD could be its own article. The Web API data, I'm sure, will be many more articles to come as well. We should be able to make this available for standards discussions in the near future if we haven't already been using it in that capacity.

The most stand-out thought for me as a developer was that going from Hackathon to Production was a long process, but not nearly as long as I thought it would be. I won't discount the amount of work that everyone had to put in to make it happen, but we are talking about a dev month or two, not dev years. The outcome from the project will certainly drive many dev years worth of improvements, so in terms of cost/benefit it is definitely a positive feature.

Contrasting this with work that I did to instrument a few APIs with use tracker before we had this profiler available, I would say the general solution came out to be much, much cheaper. That doesn't mean everything can be generally solved. In fact, my use tracker for the APIs does more than just log timing information. It also handles the parameters passed in to give us more insight into how the API is being used.

In both cases adding telemetry was pretty easy though. And that is the key to telemetry in your product. It should be easy to add, easy to remove and developers should be aware of it. If you have systems in place from the beginning to collect this data, then your developers will use it. If you don't have the facilities then developers may or may not write it themselves, and can certainly write it very poorly. As your product grows you will experience telemetry growing pains. You'll certainly wish you had designed telemetry in from the start ;-) Hopefully some of the insights here can help you figure out what level of optimization, logging, etc... would be right for your project.

Credits

I'd like to provide credit here to the many people who ended up helping with my efforts in this space. I'll simply list first names, but I will contact each of them individually and back fill their twitter accounts or a link to a blog if they want.

Credit for the original design and driving the feature goes to my PM Todd Reifsteck (@ToddReifsteck) and our expert in Chakra who built the logging system, Arjun.

Credit for all of the work to mine the data goes mainly to one of our new team members Brandon. After a seamless hand-off of the data stream from Chakra we have then merged it with many other data streams to come up with the reports we are able to use now to drive all of the insights above.

Tuesday, December 22, 2015

A 2014 WebVR Challenge and Review

I almost can't believe it was nearly 2 years ago when I started to think about WebVR as a serious and quite new presentation platform for the web. The WebGL implementation in Internet Explorer 11 was state of the art and you could do some pretty amazing things with it. While there were still a few minor holes, customers were generally happy and the performance was great.

A good friend, Reza Nourai, had his DK1 at the time and was experimenting with a bunch of experimental optimizations. He was working on the DX12 team and so you can imagine he knew his stuff and could find performance problems both in the way we built games/apps for VR and in the way the hardware serviced all of the requests. In fact, it wasn't long after our Hackathon that he got a job at Oculus and gained his own bit of power over the direction that VR was heading ;-) For the short time where I had access to the GPU mad scientist we decided if Microsoft was going to give us two days for the first every OneHack then we'd play around with developing an implementation of this spec we kept hearing about, WebVR and attaching it to the DK1 we had access to.

This became our challenge. Over the course of 2 days, implement the critical parts of the WebVR specification (that was my job, I'm the OM expert for the IE team), get libVR running in our build environment and finally attach the business end of a WebGLRenderingContext to the device itself so we could hopefully spin some cubes around. The TLDR is that we both succeeded and failed. We spent more time redesigning the API and crafting it into something useful than simply trying to put something on the screen. So in the end, we wound up with an implementation that blew up the debug runtime and rendered a blue texture. This eventually fixed itself with a new version of the libVR, but that was many months later. We never did track down why we hit this snag, nor was it important. We had already done what we set out to do, integrate and build an API set so we had something to play around with. Something to find all of the holes and issues and things we wanted to make better. Its from this end point that many ideas and understandings were had and I hope to share these with you now.

Getting Devices

Finding an initializing devices or hooking up a protected process to a device is not an easy task. You might have to ask the user's permission (blocking) or any number of other things. At the time the WebVR implementation did not have a concept for how long this would take and they did not return a Promise or have any other asynchronous completion object (callback for instance) that would let you continue to run your application code and respond to user input, such as trying to navigate away. This is just a bad design. The browser needs APIs that get out of your way as quickly as possible when long running tasks could be serviced on a background thread.

We redesigned this portion and passed in a callback. We resolved the callback in the requestAnimationFrame queue and gave you access at this point to a VRDevice. Obviously this wasn't the best approach, but our feedback, had we had the foresight at the time to approach the WebVR group, would have been, "Make this a Promise or Callback". At the time a Promise was not a browser intrinsic so we probably would have ended up using a callback, and then later, moving to a Promise instead. I'm very happy to find the current specification does make this a Promise.

This still comes with trade-offs. Promises are micro-tasks and serviced aggressively. Do you really want to service this request aggressively or wait until idle time or some other time. You can always post your own idle task once you get the callback to process later. The returned value is a sequence and so it is fixed and unchanging.

The next trade-off comes when the user has a lot of VR devices. Do you init them all? Our API would let you get back device descriptors and then you could acquire them. This had two advantages. First, we could cache the previous state and return it more quickly without having to acquire the devices themselves. Second, you could get information about the devices without having to pay seconds of cost or have to immediately ask the user for permission. You might say, what is a lot? Well, imagine that I have 6 or 7 positional devices that I use along with 2 HMDs. And lets not forget positional audio which is completely missing from current specifications.

The APIs we build for this first step will likely be some of the most important we build for the entire experience. Right now the APIs cater towards the interested developer who has everything connected, is actively trying to build something with the API and is willing to work round poor user experience. Future APIs and experiences will have to be seamless and allow normal users easy access to our virtual worlds.

Using and Exposing Device Information

Having played with the concept of using the hardware device ID to tie multiple devices together, I find the arrangement very similar to how we got devices. While an enterprising developer can make sure that their environment is set up properly, we can't assert the same for the common user. At current, we should probably assume though that the way to tie devices together is sufficient. That an average user would only have one set of hardware. But then, if that is the case, why would we separate the positional tracking from the HMD itself? We are, after all, mostly tracking the position of the HMD itself in 3D space. For this reason, we didn't implement a positional VR device at all. We simply provided the positional information directly from the HMD through a series of properties.

Let's think about how the physical devices then map to existing web object model. For the HMD we definitely need some concept of WebVR and the ability to get a device which comprises of a rendering target and some positional/tracking information. This is all a single device, so having a single device expose the information makes the API much simpler to understand from a developer perspective.

What about those wicked hand controllers? We didn't have any, but we did have some gamepads. The Gamepad API is much more natural for this purpose. Of course it needs a position/orientation on the pad so that you can determine where it is. This is an easy addition that we hadn't made. It will also need a reset so you can zero out the values and set a zero position. Current VR input hardware tends to need this constantly, if for no other reason than user psychology.

Since we also didn't have WebAudio and positional audio in the device at the time we couldn't really have come up with a good solution then. Exposing a set of endpoints from the audio output device is likely the best way to make this work. Assuming that you can play audio out of the PC speakers directly is likely to fail miserably. While you could achieve some 3D sound you aren't really taking advantage of the speakers in the HMD itself. More than likely you'll want to send music and ambient audio to the PC speakers and you'll want to send positional audio information, like gunshots, etc... to the HMD speakers. WebAudio, fortunately, allows us to construct and bind our audio graphs however we want making this quite achievable with existing specifications.

The rest of WebVR was "okay" for our needs for accessing device state. We didn't see the purpose of defining new types for things like rects and points. For instance DOMPoint is overkill (and I believe it was named something different before it took on the current name). There is nothing of value in defining getter/setter pairs for a type which could just be a dictionary (a generic JavaScript object). Further, it bakes in a concept like x, y, z, w natively that shouldn't be there at all and seems only to make adoption more difficult. To be fair to the linked specification it seems to agree that other options, based solely on static methods and dictionary types is possible.

Rendering Fast Enough

The VR sweet spot for fast enough is around 120fps. You can achieve results that don't induce simulation sickness (or virtual reality sickness) at lesser FPS but you really can't miss a single frame and you have to have a very fast and responsive IMU (this is the unit that tracks your head movement). What we found when using canvas and window.requestAnimationFrame is that we could even get 60fps let alone more. The reason is the browser tends to lock to the monitor refresh rate. At the time, we also had 1 frame to commit the scene and one more frame the compose the final desktop output. That added 2 more frames of latency. That latency will dramatically impact the simulation quality.

But we could do better. First, we could disable browser frame commits and take over the device entirely. By limiting the browser frames we could get faster intermediate outputs from the canvas. We also had to either layer or flip the canvas itself (technical details) and so we chose to layer it since flipping an overly large full-screen window was a waste of pixels. We didn't need that many, so we could get away with a much smaller canvas entirely.

We found we didn't want a browser requestAnimationFrame at all. That entailed giving up time to the browser's layout engine, it entailed sharing the composed surface and in the end it meant sharing the refresh rate. We knew that devices were going to get faster. We knew 75fps was on the way and that 90 or 105 or 120 was just a year or so away. Obviously browsers were going to crush FPS to the lowest number possible for achieving performance while balancing the time to do layout and rendering. 60fps is almost "too fast" for the web and most pages only run at 1-3 "changed" frames per second while browsers do tricks behind the scenes to make other things like user interactivity and scrolling appear to run at a much faster rate.

We decided to add requestAnimationFrom to the VRDevice instead. Now we gained a bunch of capabilities, but they aren't all obvious so I'll point them out. First, we now hit the native speeds of the devices and we sync to the device v-sync and we don't wait for the layout and rendering sub-systems of the browser to complete. We give you the full v-sync if we can. This is huge. Second, we are unbound from the monitor refresh rate and backing browser refresh rate, so if we want to run at 120fps while the browser does 60fps, we can. An unexpected win is that we could move the VRDevice off-thread entirely and into a web worker. So long as the web worker had all of the functionality we needed, or it existed on the device or other WebVR interfaces we could do it off-thread now! You might point out that WebGL wasn't available off-thread generally in browsers, but to be honest, that wasn't a blocker for us. We could get a minimal amount of rendering available in the web worker. We began experimenting here, but never completed the transition. It would have been a couple of weeks, rather than days, of work to make the entire rendering stack work in the web worker thread at the time.

So we found, that as long as you treat VR like another presentation mechanism with unique capabilities, then you can achieve fast and fluid frame commits. You have to really break the browser's normal model to get here though. For this reason I felt that WebVR in general should focus more on how to drive content to this alternate device using a device specific API rather than on piggybacking existing browser features such as canvas, full-screen APIs, and just treating the HMD like another monitor.

Improving Rendering Quality

When we were focusing on our hack, WebVR had some really poor rendering. I'm not even sure we had proper time warp available to us in the SDK we chose and certainly the DK1 we had was rendering at a low frame rate, low resolution, etc... But even with this low resolution you still really had to think about the rendering quality. You still wanted to render a larger than necessary texture to take advantage of deformation. You still wanted to get a high quality deformation mesh that matched the users optics profile. And you still wanted to hit the 60fps framerate. Initial WebGL implementations with naive programming practices did not make this easy. Fortunately, we weren't naive and we owned the entire stack so when we hit a stall we knew exactly where and how to work around it. That makes this next section very interesting, because we were able to achieve a lot without the restrictions of building against a black box system.

The default WebVR concept at the time was to take a canvas element, size it to the full size of the screen and then full screen it onto the HMD. In this mode the HMD is visualizing the browser window. With the original Oculus SDK you even had to go move the window into an area of the virtual desktop that showed up on the HMD. This was definitely an easy way to get something working. You simply needed to render into a window, and move that onto the HMD desktop and disable all of the basic features like deformation etc... (doing them yourself) to get things going. But this wasn't the state of the art, even at that time. So we went a step further.

We started by hooking our graphics stack directly into the Oculus SDK's initialization. This allowed for us to set up all of the appropriate swap chains and render targets, while also giving us the ability to turn on and off Oculus specific features. We chose to use the Oculus deformation meshes for instances rather than our own since it offloaded 1 more phase out of our graphics pipeline that could be done on another thread in the background without us having to pay the cost.

That got us pretty far, but we still had a concept of using a canvas in order to get the WebGLRenderingContext back. We then told the device about this canvas and it effectively swapped over to "VR" mode. Again, this was way different than the existing frameworks that relied on using the final textures from the canvas to present to the HMD. This extra step seemed unnecessary so we got rid of it and had the device give us back the WebGLRenderingContext instead. This made a LOT of sense. This also allowed the later movement off to the web worker thread ;-) So we killed two birds with one stone. We admitted that the HMD itself was a device with all of the associated graphics context, we gave it its own rendering context, textures and a bunch of other state and simply decoupled that from the browser itself. At this point you could almost render headless (no monitor) directly to the HMD. This is not easy to debug on the screen though, but fortunately Oculus had a readback texture that would give you back the final image presented to the HMD, so we could use that texture and make it available, on demand, off of the device so we only paid the perf cost if you requested it.

At the time, this was the best we could do. We were basically using WebGL to render, but we were using it in a way that made us look a lot more like an Oculus demo, leaning heavily on the SDK. The rendering quality was as good as we could get at the time, without us going into software level tweaks. I'll talk about some of those ideas (in later posts), which have now been implemented I believe by Oculus demos and some industry teams, so they won't be anything new, but can give you a better idea of why the WebVR API has to allow for innovation and can't simply be a minimal extension of existing DOM APIs and concepts if it wants to be successful.

Improvements Since the Hackathon

We were doing this for a Microsoft Hackathon and our target was the Internet Explorer 11 browser. You might notice IE and later Microsoft Edge doesn't have any support for WebVR. This is both due to the infancy of the technology, but also due to their not being a truly compelling API. Providing easy access to VR for the masses sounds great, but VR requires a very high fidelity rendering capability and great performance if we want users to adopt it. I've seen many times where users will try VR for the first time, break out the sick bag, and not go back. Even if the devices are good enough, if the APIs are not good then it will hold back the adoption rates for our mainstream users. While great for developers, WebVR simply doesn't set, IMO, the web up for great VR experiences. This is a place where we just have to do better, a lot better and fortunately we can.

The concept of the HMD as its own rendering device seems pretty critical to me. Also, making it have its own event loop and making it available on a web worker thread also go a long way to helping the overall experience and achieving 120fps rendering sometime in the next two years. But we can go even further. We do want, for instance, to be able to render both 3D content and 2D content in the same frame. A HUD is a perfect example. We want the devices to compose, where possible, these things together. We want to use time warp when we can't hit the 120fps boundaries so that there is a frame that the user can see that has been moved and shifted. Let's examine how a pre-deformed, pre-composed HUD system, would look using our existing WebVR interfaces today if we turned on time warp?

We can use something like Babylon.js or Three.js for this and we can turn on their default WebVR presentation modes. By doing so, we get a canonical deformation applied for us when we render the scene. We overlay the HUD using HTML 5, probably by layering it over top of the canvas. The browser, then snapshots this and presents it to the HMD. The HUD itself is now "stuck" and occluding critical pixels from the 3D scene that would be nice to have. If you turned on time warp you'd smear the pixels in weird ways and it just wouldn't look as good as if you had submitted the two textures separately.

Let's redo this simulation using the WebGLRenderingContext on the device itself and having it get a little bit more knowledge about the various textures involved. We can instead render the 3D scene in full fidelity and commit that directly to the device. It now has all of the pixels. Further, it is NOT deformed, so the device is going to do that for us. Maybe the user has a custom deformation mesh that helps correct an optical abnormality for them, we'll use that instead of a stock choice. Next we tell the device its base element for the HUD. The browser properly layers this HUD and commits that as a separate texture to the device. The underlying SDK is now capable of time warping this content for multiple frames until we are ready to commit another update and this can respond to the user as they move their head in real-time with the highest fidelity.

You might say, but if you can render at 120fps then you are good right? No, not really. That still means up to 8ms of latency between the IMU reading and your rendering. The device can compensate for this by time warping with a much smaller latency by sampling the IMU when it goes to compose the final scene in the hardware. Also, since we decomposed the HTML overlay into its own texture, we can also billboard that into the final scene, partially transparent, or however we might want to show it. The 3D scene can shine through or we can even see new pixels from "around" the HUD since we didn't break them away.

Conclusion

Since our hack, the devices have changed dramatically. They are offering services, either in software or in the hardware that couldn't have been predicted. Treating WebVR like a do it all shop and then splash onto a flat screen, seems like its not going to be able to take advantage of the features in the hardware itself. An API that instead gets more information from the device and allows the device to advertise features that can be turned on and off might end up being a better approach. We move from an API that is device agnostic, to one that embraces the devices themselves. No matter what we build, compatibility with older devices and having something "just work" on a bunch of VR devices is likely not going to happen. There is simply too much innovation happening and the API really has to allow for this innovation without getting in the way.

Our current WebVR specifications are pretty much the same now as they were when we did our hack in 2014. Its been almost 2 years now and the biggest improvement I've seen is the usage of the Promise capability. I don't know what advances a specification such as WebVR, but I'm betting the commercial kit from Oculus coming out in 2016 will do the trick. With a real device gain broad adoption there will likely be a bigger push to get something into all of the major browsers.

Sunday, August 2, 2015

3 Days of Data Science

Last week was Microsoft's OneWeek and during that week we run a special event called the OneHack. This is a 3 day hackathon which is just like any other in which groups of engineers come together to build things that they think can have an impact at Microsoft. A hack can be anything, you aren't told what to hack on, it is your decision. Who you hack with, your decision. Its great. It is 3 days of time for you to prove a point and show your team, your organization and Microsoft that there is a better way to do something!

I decided that my hack would center around metrics that I could collect with a DOM profiler. Not a sampling DOM profiler like those that ship in the dev tools, but a full blown, call attributed profiler capable of telling you absolutely if a given site calls a given method. This project played in really well with the new culture of using data science to prioritize work and gain insights about the project.

The amount of data was going to be large, very large, as we had multiple methods available to us to scale out the collection. At the same time, the number of datasets which we could key back to the profiling data was also going to be quite large.

Thankfully, I was able to procure a small but amazingly skilled team of people ranging from data scientists, to visualization experts, to automation experts. Everyone was able to self direct while identifying the next potential insight and then working towards integrating the data to make the insight viable.

The rest of this article will describe the process insights I gained throughout the process. Process insights that can hopefully help you in future endeavors of a similar sort.

Even Data Must be Tested

Humans are amazingly good at spotting things that defy the general pattern of what they are seeing. For this reason, every time we were working with data we kept hearing someone say, "Something doesn't look right" or "Does it make sense that this value would be 0?" or me at one point saying "How is my min greater than my max?"

We were collecting gigabytes of data, sometimes recollecting it, and after collection every strip of data had to go through a transformation process. For us this was using the WPA or Windows Performance Analyzer tools to decode some events that were emitted by the profiler. Once decoded, those events had to be split from their files, grouped, and joined back together again under our chosen key, which was consequently the URL of the website which had executed the script. During this process there were so many things that could go wrong, the most likely of which was the CSV processing getting choked up on an invalid CSV. This caused us to think about simple things that we could do in order to validate data. For instance:

Every profile output will have exactly 8 columns of data, not 1, not 14, but 8. So simply ensuring that the columns of data were there was a huge step in finding out when the CSV format was failing.
Every profile output will have a string in column 2 and the rest of the columns will be numeric. Parsing each column to numeric and validating it was another good fix.

Once we had the site to API data. We then had to join or aggregate all of the similar site data together. This is where I failed miserably. In my haste I summed the min and max columns rather than computing, you know, actual min and max. This led to some craziness, like max being greater than total in some cases. So then we came up with some additional tests.

The max should always be greater than or equal to the min.
If there is only a single count, make sure min, max and total are equal.
Test for negatives and overflows in your data. When working with gigabytes you are often surprised by overflow of common 32-bit ranges. We consequently used 64-bit and didn't experience this, yay!

You might ask, if you are just jumping in, how would you spot these issues? As noted, humans rock at spotting bad data. You can generally glance at a screen and spot it. But it isn't all that easy. So my recommendation is draw the data into Excel or your favorite graphing library and graph it. At this point a lot of things will now visually fall out. A single graph is worth a thousand rows of data!

Graphs can even be a part of your testing methodology if you are willing to have your transformation process simply export graphs. We were using both a combination of Perl, C#, Node and io.js to do all of our data processing, so depending on what language you were in graphing was more or less easy.

Intermediate Data is Still Data

When working on the data you start with very simple things. Some of the things we started with were:

Profiling data for thousands of websites and their iframe content resulting in more than 100k unique URLs calling a couple of thousand DOM APIs.
Profiling data for our check-in test suite keyed by test name.
A test suite file containing the test name along with more meta-data, like test owner.
A description of every browser's API surface area gleaned by crawling their DOMs.
A complete description of the EdgeHTML API surface area gleaned through our FastDOM Compiler (consequently the same compiler that emits the instrumentation).

And there are things we didn't even know about until we got into the hack.

For each test owner, which team are they on? Aggregating by team is a far more powerful visualization than by owner when you are in a large organization like Microsoft Edge.
A description of every browser API from the public specifications.
Unions, intersections and differences between different browser's API surface area.
Tons of filters for things where APIs were defined in different places on different browsers. Example: webkitRequestAnimationframe and requestAnimationFrame should be merged as they are "aliases"

Given all of these fairly complete data sources, we kept finding that to generate new insights, we needed different combinations. We often had the combinations available, but you had to load a monstrous data source to get the values. Most data sources were in CSV format, but some were in XML and others in JSON. Different people needed data in different formats and so often having it in CSV was okay, but having it in JSON was better and so a conversion was needed.

Often when converting between formats or doing joins you lose data as well. Having these intermediates before the data is lost can be useful to others. Let me provide a couple of examples.

DOM Properties have either a getter or a getter/setter pair. Our profiler was capable of emitting telemetry separately for these two entry points. This meant that most data was keyed based on the DOM API name, however, our original data was keyed based on the API name AND the entry point type. We needed the profiler data to join with several other pieces of data and be emitted as an intermediate before we were able to collapse the key identities down to just the API name. This allowed us, on our final pages, to list for a given API, the separated performance characteristics of the getter and setter.

When aggregating URL down to domain which is necessary for properly tabulating a group of website pages into a single consistent view, you end up losing many URLs. For example, an about:blank page should have the same URL as its parent, but it is also an about:blank page. If you want to backtrack the data to the page to investigate manually, you need the original URL. By building most of our intermediates with URL in mind and having only the final output convert URL down to domain, you are able to always recover data. In this case, adding a column to your data for domain, but keeping everything else allows you to build the aggregate group by domain. I'll get more into this later since this hits on another important concept.

Given this, when you are working with lose data, I highly recommend keeping all intermediate outputs that you produce. Also, document them all and which data they pulled in, which scripts were used, and what all of the output files were. If you can, write a script that will re-run all of the conversion stages on the original data and produce your current state. We consistently found the following issues when working as a group...

Someone was working on a data file we could no longer produce. Perhaps a prior version.
Someone needed an intermediate that we had deleted.
Someone needed an intermediate + a conversion that we had lost the script for.

All of these slowed us down and could have been easily avoided.

Schema, Schema, Schema, and a SCHEMA

At some point it became obvious we had too much data for everyone to be able to remember everything that was available. In the beginning I was the data oracle since I had previously worked the most with this data. However, in the end we had everyone producing NEW data and so my days as oracle were numbered. It becomes critically important to have schemas available the longer your data munging exercise continues.

A schema is NOT costly. The schema for a CSV file is a header column. At first, we didn't have header columns. Mostly due to the fact that when I write Perl I don't want to have to cut the first line ;-) I'm lazy apparently. But this lead to massive confusion. When you have a CSV filled with 7 straights columns of numbers you start to question what they mean. The cost of a header column to a Perl script is one additional line of code, the cost of not putting a header column for humans is hours of explaining what the columns mean. Hang it up if you ever accidentally transpose the data ;-)

A schema is like a design document or diagram. It is an abstraction. It allows others to extend the design without your input because you've already explained your portion of the design. With a schema you will gain the ability to create new cuts of the data and insights without having to dig deeply into the data itself and you can test new ideas without having to write code.

The basis of the previous point is that once you have a schema you can compose your data together in new ways. This can allow you to see how all of your data is interrelated and leads to the use of existing data for creating new diagrams and insights. When you don't have a schema, you find yourself recomposing data from baseline sources because in previous iterations you didn't groom the data appropriately. Let me provide an example from the URL vs domain case.

URL,Member,Calls
http://www.contoso.org/a,get HTMLElement::innerHTML,1
http://www.contoso.org/b,get HTMLElement::innerHTML,1

domain,Member,Calls
http://www.contoso.org,get HTMLElement::innerHTML,2

At some point we needed the domain mapping to reduce our reporting matrix to around 1000 data points. However, later we needed the URL as well when building the web pages which presented this data. That way you could understand which pages on contoso.org were using the value, since just going to contoso.org was unlikely to induce the property get. Had we been thinking about this in terms of a proper schema, we would have simply added a column to the data that would allow us to dynamically aggregate.

URL,domain,Member,Calls
http://www.contoso.org/a,http://www.contoso.org,get HTMLElement::innerHTML,1
http://www.contoso.org/b,http://www.contoso.org,get HTMLElement::innerHTML,1

Then we could generate a view on top of this such as (select domain,Member,SUM(Calls) group by domain,member). If that is too costly to compute each time, we can always schedule a task to recompute it.

This insight alone would have advanced our project by another 2 days worth of our existing effort had we employed it from the very beginning. While sad that we were unable to employ it in time, there is always a next time. In fact, as we seek to make our efforts permanent this type of insight will go a long way towards improving how we push data through our pipeline.

Data is NOT Hierarchical, Until you Display It

I produced so much data in JSON throughout the hack. We started with 2 MB JSON files, then 14 MB JSON files, then 150 MB JSON files. Why? This was our hierarchical data stores and it "enabled" the code which generated the pages to step through increasingly more detailed information by simply walking from the parent into the more detailed children. Effectively I was treating all of my data as hierarchical because my output in display was hierarchical.

Why is this a bad thing? It makes filtering, sorting and other operations a total mess. My data could be have been a much larger table with a lot of redundant data. Then I could make all of my decisions at a high level, only once filtering and sorting were done would I feed all of this into the page generation routines. The opposite is that I have to filter at EVERY level in the hierarchy instead. Or at least I have to filter at every level where the data exists that I'm trying to filter on. Let's take a simple JSON where we have levels for interfaces, the members they contain, then information about the member. I want my output to be filtered by the "owner" of the member. Here is the JSON.

If I filter on "Jim", then I expect that I should first see interface HTMLElement, then the member className. I should be able to reasonably know that I have to display all of that data. To do that, I probably loop through all of the interface key names, then pass the interface down to another function which loops through the member key names, and finally pass those into another function which will "finally" filter out "Jim" and decide to show something. However, at this level, I've lost key information, unless I've enclosed around it all, such that I don't know the interface and potentially the member that I'm inside of. I've not encoded the key into the data itself, I've encoded the key into the containing structure instead.

How do I fix it? Well, encode the keys as I've described so that the members gain enough information. That would be a good start. But what happens then when I have two properties on the same interface. How do I know to only emit the interface once? And how do I know to end the interface? Okay, things are getting nasty fast. It turns out you end up first walking the entire data structure and memorizing enough information to then go back through and only show those things that belong to Jim. You end up replicating a portion of the hierarchy through the filter process, such that you can walk the "filtered" view and use the "original" view to get any missing data. This is not a great approach. Its error prone and the code gets ugly. I know, I've got tens of thousands of lines of it now :-(

A simpler approach is to simply produce the flat data file. Do the filtering on that, then create the hierarchy on the client as necessary. This approach allows for so much flexibility since it also allows me to send the filter to the server if don't want to retrieve the entire dataset to produce the view. Here is a tabular view of the same data and a filtering/aggregation of it that produces what I want to be able to create Jim's view of the data.

This is another example of a realization we were not able to take advantage of. We had people in R, others in PowerBI, some in Excel, one guy in C#, and me in Node.js and we were all using all of these flat files in different ways. Because our important data was in hierarchical format, it wasn't accessible to many of the tools. Focusing more on the schema and focusing on flat data tables would have empowered the entire group. This leads me to a final realization about our data.

SQL is Probably Good Enough

At no point when I was looking at the data did I think, man, I should put this into a SQL Server really quick and just do some views. I don't know why. I'm an old school SQL Server guy. I've worked on, for their time, top tier websites. Usually we powered those with huge SQL instances.

SQL and some simple stored procedures could have done most of the aggregation that I was doing offline in scripts. It wold have given me the schema information that I was after. I could have made quick filters and aggregates in a language built for providing me such information. Had I just put my data in SQL I probably would have been way better off. Maybe it would have limited some of the other members of the team, but I'm betting my data exports to them would have been of much higher quality and would have more than made up for the hours we spent grovelling huge data files on the disk.

Further, it would have enabled the scale that we are undoubtedly going to be tackling next. Now, it would have required a permanent instance be up for people to connect to, but that is an operations problem and is really easy to solve. Right now our data is on some network share. Its going to be hard to make that an operations problem. Further, how do we "add data" over time or "version data" or even "back it up". I made a couple of cab archives along the way, but that is literally all we have from the hack.

Now, it was a hack, so I can't fault myself for these mistakes but I can learn from it. Once your data outgrows Excel, its time for SQL. Tons of tools know how to talk to SQL and it would have sped up our iteration speeds and improved our schema generation throughout the hack. Now I know ;-)

Conclusion

This is one arc of my experience during the hack. The data arc. This is where I rediscovered many things I had forgotten about building large data pipelines. There were many more experience arcs as well, especially around the insights that we had. More importantly, when we shared our insights we got a lot of really positive feedback from the team. We even decided to create a new alias for openly sharing data sets more broadly. This will allow others to hopefully find problems, improve on, create new insights from, and hopefully accept our data.

I'll end with a variation of an info-graphic that our data science guy shared with us. His had 8 or so stages of data grief, but the classic graphic apparently has 5 and they definitely seem appropriate given my experiences.