Sunday, September 20, 2015

Fixing Web Interoperability with Testing

For an old and mature API surface area such as HTML 5 you would think that it would be relatively well tested. We've put years of effort into writing tests, either as browser vendors, or maybe just as sites or frameworks to make sure that our stuff works. But rarely do these efforts scale to the true size of the API set, its complications, its standardization, its interop across browsers and all of the quirks that exist that the web is reliant on.

However, as judged by the numerous differences between browsers, it is pretty clear there are no canonical tests. Tests for the web platform, if you assumed all browsers were correct in their own way, would have tens of acceptable results. Developers would have to be aware of these different results and try to code in a way that allowed all such possibilities. Alas, we don't have tests like this and developers only write to one given implementation.

What does Interop Look Like Today?

Currently Interop between browsers is decided by site reports and bug fixes against those site reports. It means reductions of millions of lines of JavaScript, HTML and CSS to figure out what might be causing the problem for the given site.

Depending on which browser vendor is trying to match another, those bug fixes may or may not be good for the web. Arguably implementing a bunch of webkit prefixes is great to make sites work, users might even be happy as well, but it reduces the uptake on the standards versions of those same APIs. To the extent, that a relatively well positioned webkit API might actually need to be in the standard and marked as an alias for the standard itself. I'm not picking on webkit prefixes here either, it just so happens that they were more popular and so became the mind share leaders.

So browser vendors rely on site reporting and mass per site testing, analysis of complicated scripts and reductions and potentially silent bug fixes to arrive at some sort of interoperable middle ground where most existing sites just work, most newly written sites are probably portable, and the web has multiple different ways of doing the same thing. This creates a complicated, sparsely tested API, with many developer pitfalls. Current web interop really isn't looking so good.

But, are tests really going to make a difference?

Web Testing Statistics

I am about to throw out numbers. These numbers are based on real testing, aggregation, and profiling, however, they are not precise. We have not exhaustively tested our own collection methodology to ensure that we aren't missing something. But given the size of the numbers we believe these to be fairly statistically accurate. I'll apply an assumption that I have a +/- 5% error in the data I'm about to present.

Also, these are about tests that we run against EdgeHTML. Which also means the data collected is about how these tests work in EdgeHTML. It is possible that numbers swing and change as features are detected in different browsers. Okay, you probably just want the numbers ;-)

Test Set Unique APIs Comments
Top 10000 WebSites ~2300 Site Crawler live data, not testing
EdgeHTML Regression Suite ~2500 Filtered, incidental usage not counted
Top WebDriver Enabled Suites ~2200 Web Driver enabled, any browser can run
CSS 2.1 Suite ~70 Not an API testing suites

But these numbers don't mean anything, how many APIs are there? I can't give the number for every browser, but I can tell you that EdgeHTML is actually on the low side due to how our type system works. We've adopted an HTML 5 compliant type system earlier than some of the other browsers. We've also worked hard to reduce vendor prefixes. For this reason, Safari, may have over 9k APIs detectable in the DOM. FireFox, which has a high level of standards compliance, will often have may moz prefixed properties and therefore their numbers will be elevated as well. 

But for EdgeHTML the statistics are that we have ~4300 APIs and ~6200 entry points. The entry points is important since these represent things that aren't necessarily visible to the web developer. The way we do callable objects (objects that behave like functions) for instance. Or the fact that read-write properties have both a getter and a setter which must be tested.

The numbers we collected on testing have been aggregated to APIs, but we also keep the data on entry-points. The entry-point data always results in a lower percentage of testing, since not testing a getter or setter can only take away from your percentage it can't add to it.

So what if we added up all of the unique coverage? If we do that across the suites, then we end up with ~3400 APIs covered. That's not bad, we are starting to show that the "used web" might heavily intersect in that juncture right? Okay, lets run a few more numbers then.

What about unique to the top 10000 web sites? Turns out there are ~600 APIs in that group. Ouch, that could be an interop nightmare. If we instead use just publicly available testing (and remove our proprietary testing) the numbers jump up significantly. We are now over 1200 APIs that are hit by live sites which are not tested by public test suites... Ugh.

Confidence in Existing Tests

So we've talked a lot about the testing that does and does not exist, but what about the confidence we apply to the existing testing. This is going to be less data driven since this data is very hard for us to mine properly. It is, in fact, a sad truth that our tracking capabilities are currently lacking in this area and it creates a nasty blind spot. I'm not aware of this state for other browser vendors, but I do know that the Blink project does a lot of bug tagging and management that I would argue is superior to our own. So kudos to them ;-)

So notice that the EdgeHTML test suite is considered a regression suite. This is an evolution of testing that started likely 20 years ago. While we have added some comprehensive testing for new features, many older features, the old DOM if you will, only has basic testing which is mostly focused on verifying the API basically works, that it most certainly doesn't crash and often contains some set of tests that were previously bugs in the API that we eventually fixed and created a "regression" test for. In IE 9, which is where this legacy DOM suite forks into a legacy and modern suite, we carry over about half of our entry points. I have numbers from this period that range between 3100 and 3400 entry points depending on the document mode.

Given we mostly have regression suites, we also find that many APIs are only hit a small number of times. In fact, our initial test numbers were quite high, around 60% entry point coverage, until we factored out incidental hits. Once the filtering was employed we were back down at 30% and that is a high amount, since there is no perfect way to filter out incidental usage.

This all combines to make my confidence in our regression suite at about 50/50. We know that many of the tests are high value because they represent actual bugs we found on the web in how websites are using our APIs. However, many of them are now more than 5, 10 or even 15 years old.

What about the public web-test suites that we covered? Well, that I'm going to also call 50/50 because we aren't really seeing that those suites touch the same APIs or test the APIs that the web actually uses. I mean, 1200 APIs not hit by those suites is pretty telling. Note, we do add more public test suites to our runs all the time, and quality of a suite is not purely about how many APIs they test. There is a lot of value, for instance, in a suite that only tests one thing extremely well. This requires that you grade the suites quality through inspection. This is a step we have not done. We rely on jQuery and other major frameworks to vet their own suites and provide us with a rough approximation of something that matches their requirements.

So in all, given that our existing suites aren't very exhaustive, they often test very old functionality, and there are a lot of missing APIs in the suites that are used by top sites, I'd subjectively rank the current test suites in my low confidence category. How can we get a medium to high confidence suite, that is cross vendor/device capable, and if we could what would it look like?

Tools for a Public Suite

We already have some public suites available. What makes a suite public? What allows it to be run against multiple browsers? Let's look at some of the tools we have available for this.

First and foremost we have WebDriver, a public standard API set for launching and manipulating a browser session. The WebDriver protocol has commands for things like synthetic input, executing script in the target and getting information back from the remote page. This is a great start for automating browsers in the most basic ways. It is not yet ready to test more complicated things that are more OS specific. For instance, synthetic input is not the same as OS level input. Even OS level synthetic input is not the same as input from a driver within the OS. Often times we have to test deeply to verify that our dependency stacks work. However, the more we do this, the less likely the test is going to work in multiple browsers anyway. Expect advancements in this area over the coming months and years as web testing gets increasingly more sophisticated and WebDriver evolves as a result.

But WebDriver can really test anything in an abstract way. In fact, it could test an application like Notepad if there were a server for Notepad that could create sessions and launch instances. Something that generic doesn't sound like a proper tool, in fact it is just part of a toolchain. To make WebDriver more valuable we need to specify tests in ways that anyone's WebDriver implementation can pick them up, run them, and report results.

For this we have another technology called TestHarness.js. This is a set of functionality that when imported into your HTML pages allows them to define suites of tests. By swapping in a reporting script, the automation can then change the reporting to suit its own needs. In this way you write to a test API by obeying the TestHarness.js contract, but then you expect that someone running your test will likely replace these scripts with their own versions so that they can capture the results.

The final bit of the puzzle is enabling tests to be broadly available and usable. Thankfully there is a great, public, open source solution to this problem in GitHub. By placing your tests in a public repository and using TestHarness.js, you can be rest assured that a browser vendor can clone, run and report on your test suites. If your test suite is valuable enough it can be run by a browser vendor against their builds, not only protecting them from breaking the web, but protecting your from being broken. For large frameworks in broad usage across the web this is probably the best protection you could have to ensure your libraries continue to work as the web evolves.

The Vendor Suite

Hopefully the previous section has replaced any despair that I might have started at the beginning of the article with some Hope instead. Because I'm not hear to bum anyone out. I'm hear to fix the problem and accelerate the convergence of inteoperability.

To this end, my proposal is what I call the vendor suite. And note, I call it this, but it doesn't matter if a vendor releases this suite or someone in the community, but it has to be informed by the vendors and use data from the actual browsers in order to achieve a level of completeness the likes of which we have never seen. Also, these are my own opinions, not those of the my entire team which increases the likelihood that a large undertaking such as this may be something driven by the community itself and in turn be supported by the vendors.

The qualities of the vendor suite are that it accomplishes the following immediate goals...

  1. We define tests in 3 levels
    1. API - Individual API tests, where each API has some traits and some amount of automated and manually created testing is involved. The level of testing is at a single API, and the number of tests is dependent on the complexity of the API itself.
    2. Suite - A more exhaustive set of tests which focus on a specification or feature area. These suites could accompany a specification and be part of their contribution to the standards process.
    3. Ad Hoc - Tests which represent obscure, but important to test behaviors. Perhaps these behaviors are only relevant to a given device or legacy browser. Potentially their goal is to "fail" on a compliant and interoperable browser.
  2. Tests should be written to verify the interoperable behavior until all browsers agree to transition to the standards behavior if their is a discrepancy. Interoperability is the primary goal. Do websites work? Not, does a browser adhere to a spec for which no web developer is pining.
  3. As much testing should be driven by traits as possible.
    1. Attributes - At an API level the behavior of a property - attribute pair will be generically tested. Examples of common behavior are how the property - attribute reflect one another. Also, how MutationObserver's are fired.
    2. Due to traits based testing we can verify down to the ES 6 level the configuration of the property. This allows auto-generation of testing for read-only, "use strict" adherence, along with other flags such as enumerability and configurability.
    3. Tests should be able to be read and rewritten while maintaining any hand-coded or manually added value. Things like throwing exceptions are often edge cases and would need someone to specifically test for those behaviors.
  4. No vendor will pass the suite. It is simply not possible. For example:
    1. Vendor prefixed APIs will be tested for.
    2. Hundreds if not thousands of vendor specific APIs, device configurations, scenario configurations, etc... may be present.
  5. To this end, the baselines and cross comparisons between browsers are what is being tested for. Vendors can use the results to help drive convergence and start talks on what we want the final behavior to be. As we align to those behaviors the tests can be updated to reflect our progress.
Having such a large scale, combined effort is really how we start converging the web. This is how we all start quickly working through obscure differences with a publicly visible suite, publicly contributed to, and to be honest, prioritized by you, the community. With such obscure differences out of the way it frees up our resources to deliver new web specifications and further achieve convergence on new APIs much more quickly. 

Final Thoughts

So what do you think? Crazy idea? Tried a thousand times? Have examples of existing collateral that you think accomplishes this goal? I'm open to all form of comment and criticism. Any improvements to the web that let me get out of the business of reducing web pages and finding issues earlier is going to be a huge win for me regardless of how we accomplish it. I want out of the business of trying to understand how a site is constructed. There are far more web developers than there are browser developers so the current solutions simply don't scale in my opinion.

As another thought, what about this article is interesting to you? For instance, are the numbers surprising or about what you thought in terms of how tested the web is? Would you be interested in understanding or seeing these numbers presented in a more reliable and public way?