Showing posts with label Testing. Show all posts
Showing posts with label Testing. Show all posts

Tuesday, December 29, 2015

Progress Towards a Fully Tested Web API Surface Area

Back in September I was amazed by the lack of comprehensive testing present for the Web API surface area and as a result I proposed something akin to a basic surface level API suite that could be employed to make sure every API had coverage. Since that time a lot of forward progress has occurred, we've collected a bunch of additional data and we've come up with facilities for better ensuring such a suite is complete.

So let's start with the data again and figure out what we missed last time!

Moar Data and MOAR Test Suites

Adding more test suites is hopefully going to improve coverage in some way. It may not improve your API surface area coverage, but it may improve the deep usage of a given API. We originally pulled data from the following sources:
  1. Top 10k sites - This gives us a baseline of what the web believes is important.
  2. The EdgeHTML Regression Test Suite - By far the most comprehensive suite available at the time, this tested ~2500 API entry points well. It did hit more APIs, but we excluded tests which only enumerated and executed DOM dynamically.
  3. WebDriver Enabled Test Suites - At the time, we had somewhere between 18-20 different suites provided by the web community at large. This hit ~2200 APIs.
  4. CSS 2.1 Test Suite - Mostly not an OM test so only hit ~70 APIs
Since then we've added or improved the sources:
  1. Top 100k sites - Not much changed by adding sites.
  2. Web API Telemetry in EdgeHTML - This gave us a much larger set of APIs used by the web. It grew into the 3k+ range!! But still only about 50% of the APIs we export are used by the Web making for a very large, unused surface area.
  3. DOM TS - An internal test suite built during IE 9 to stand up more Standards based testing. This suite has comprehensive depth on some APIs not tested by our other measures.
  4. WPT (Web Platform Tests) - We found that the full WPT might not be being run under our harnesses, so we targeted it explicitly. Unfortunately, it didn't provide additional coverage over the other suites we were already running. It did end up becoming part of a longer term solution to web testing as a whole.
And thanks to one of our data scientists, Eric Olson, we have a nice Venn Diagram that demonstrates the intersection of many of these test suites. Note, I'm not including the split out WPT tests here, but if there is enough interest I can probably try to see if we can try a different Venn Diagram that can include more components or rework this one and pull out an existing pivot.


Since this is so well commented already, I won't go into too much, but I'll point out some key data points. The EdgeHTML DRTs have a lot of coverage not present in any public suites. That is stuff that is either vendor prefixed, MS specific or that we need to get into a public test suite. It likely requires that we do some work, such as conversion of the tests to test-harness.js before that happens, but we are very likely to contribute some things back to the WPT suite in the future. Merry Christmas!?!

We next found that the DOM TS had enough coverage that we would keep it alive. A little bit of data science here was the difference between deleting the suite and spending the development resources to bring it back and make it part of our Protractor runs (Protractor is our WebDriver enabled harness for running public and private test suites that follow the test-harness.js pattern).

The final observation to have is that there are still thousands of untested APIs even after we've added in all of the coverage we can throw together. This helped us to further reinforce the need for our Web API test suite and to try and dedicate the resources over the past few months to get it up and running.

WPT - Web Platform Test Suite

In my original article I had left out specific discussions of the WPT. While this was a joint effort amongst browsers, the layout of the suite and many aspects of its maintenance were questionable. At the time, for instance, there were tons of open issues, many pull requests, and the frequency of updates wasn't that great. More recently there appears to be a lot of new activity though so maybe this deserves to be revisited as one of the core suites.

The WPT is generally classified as suite based testing. It is designed to be as comprehensive as possible. It is organized by specification, which arguably means nothing to web developers, but does mean something to browser vendors. For this reason, many of the ad-hoc and suite based testing which was present in the DRTs, if upgraded to test-harness.js, could slot right in. I'm hopeful that sometime after our next release we are also able to accompany it with an update for WPT that includes many of our private tests so that everyone can take advantage of the collateral we've built up over the years.

Enhancing the WPT with this backlog of tests, and potentially increasing coverage by up to ~800 APIs, will be a great improvement I think. I'm also super happy to see so many recent commits from Mozilla and so many merge requests making it back into the suite!

Web API Suite

We still need to fix the API gap though and so for the past couple of months we've (mostly the work of Jesse Mohrland, I take no credit here) been working on a design which could take our type system information and automatically generate some set of tests. This has been an excellent process because we've now started to understand where more automatically generated tests can be created and that we can do much more than we originally thought without manual input. We've also discovered where the manual input would be required. Let me walk through some of our basic findings.

Instances are a real pain when it comes to the web API suite. We have about 500-600 types that we need to generate instances of. Some may have many different ways to create the instances that result in differences of behavior as well. Certainly creating some elements will result in differences in their tagName, but they may be of the same type. Since we are an API suite we don't want to force each element to have its own suite of tests, instead we focus on the DOM type and thus we just want to test 1 instance generically and then run some other set of tests on all instances.

We are not doing the web any service by only having EdgeHTML based APIs in our list. Since our dataset is our type system description, we had to find a way to add unimplemented stuff to our list. This was fairly trivial, but hasn't yet been patched into the primary type system. This has so many benefits though. Enough that I'll enumerate them in a list ;-)

  1. We can have a test score the represents even things we are missing. So instead of only having tests for things that exist, we have a score against things we haven't implemented yet. This is really key towards having a test suite not just useful to EdgeHTML but also to other vendors.
  2. True TDD (Test Driven Development) can ensue. By having a small ready-made basic suite of tests for any new APIs that we add, the developer can check in with higher confidence. The earlier you have tests available the higher quality your feature generally ends up being.
  3. This feeds into our other data collection. Since our type system has a representation of the DOM we don't support, we can also enable things like our crawler based Web API telemetry to gather details on sites that support APIs we don't yet implement.
  4. We can track status on APIs and suites within our data by annotating what things we are or are not working on. This can further be used to export to sites like status.modern.ie. We don't currently do this, nor do we have any immediate plans to change how that works, but it would be possible.
Many of these benefits are about getting your data closer to the source. Data that is used to build the product is always going to be higher quality than say data that was disconnect. Think about documentation for instance which is built and shipped out of a content management system. If there isn't a data feed from the product to the CMS then you end up with out of data articles for features from multiple releases prior, invalid documentation pages that aren't tracking the latest and greatest and even missing documentation for new APIs (or removing documentation for dead APIs).

Another learning is that we want the suite to be auto-generated for as many things as possible. Initial plans had us sucking in the tests themselves, gleaning user generated content out of them, regenerating and putting back the user generated content (think custom tests written by the user). The more we looked at this, the more we wanted to avoid such an approach. For the foreseeable future we want to stop at the point where our data doesn't allow us to continue auto-generation. And when that happens, we'll update the data further and continue regenerating.

That left us with pretty much a completed suite. As of now, we have a smallish suite with around 16k tests (only a couple of tests per API for now) that is able to run using test-harness.js and thus it will execute within our Protractor harness. It can trivially then be run by anyone else through WebDriver. While I still think we have a few months to bake on this guy I'm also hoping to release it publicly within the next year.

Next Steps

We are going to continue building this suite. It will be much more auto-generated than originally planned. Its goal will be to test the thousands of APIs which go untested today by more comprehensive suites such as WPT. It should test many more thousands of unimplemented APIs (at least by our standards) and also some APIs which are only present in specific device modes (WebKitPoint on Phone emulation mode). I'll report back on the effort as we make progress and also hope to announce a future date for the suite to go public. That, for me, will be an exciting day when all of this work is made real.

Also, look out for WPT updates coming in from some of  the EdgeHTML developers. While our larger test suite may not get the resources to push to WPT until after our next release I'm still hopeful that some of our smaller suites can be submitted earlier than that. One can always dream ;-)

Sunday, September 20, 2015

Fixing Web Interoperability with Testing

For an old and mature API surface area such as HTML 5 you would think that it would be relatively well tested. We've put years of effort into writing tests, either as browser vendors, or maybe just as sites or frameworks to make sure that our stuff works. But rarely do these efforts scale to the true size of the API set, its complications, its standardization, its interop across browsers and all of the quirks that exist that the web is reliant on.

However, as judged by the numerous differences between browsers, it is pretty clear there are no canonical tests. Tests for the web platform, if you assumed all browsers were correct in their own way, would have tens of acceptable results. Developers would have to be aware of these different results and try to code in a way that allowed all such possibilities. Alas, we don't have tests like this and developers only write to one given implementation.

What does Interop Look Like Today?

Currently Interop between browsers is decided by site reports and bug fixes against those site reports. It means reductions of millions of lines of JavaScript, HTML and CSS to figure out what might be causing the problem for the given site.

Depending on which browser vendor is trying to match another, those bug fixes may or may not be good for the web. Arguably implementing a bunch of webkit prefixes is great to make sites work, users might even be happy as well, but it reduces the uptake on the standards versions of those same APIs. To the extent, that a relatively well positioned webkit API might actually need to be in the standard and marked as an alias for the standard itself. I'm not picking on webkit prefixes here either, it just so happens that they were more popular and so became the mind share leaders.

So browser vendors rely on site reporting and mass per site testing, analysis of complicated scripts and reductions and potentially silent bug fixes to arrive at some sort of interoperable middle ground where most existing sites just work, most newly written sites are probably portable, and the web has multiple different ways of doing the same thing. This creates a complicated, sparsely tested API, with many developer pitfalls. Current web interop really isn't looking so good.

But, are tests really going to make a difference?

Web Testing Statistics

I am about to throw out numbers. These numbers are based on real testing, aggregation, and profiling, however, they are not precise. We have not exhaustively tested our own collection methodology to ensure that we aren't missing something. But given the size of the numbers we believe these to be fairly statistically accurate. I'll apply an assumption that I have a +/- 5% error in the data I'm about to present.

Also, these are about tests that we run against EdgeHTML. Which also means the data collected is about how these tests work in EdgeHTML. It is possible that numbers swing and change as features are detected in different browsers. Okay, you probably just want the numbers ;-)

Test Set Unique APIs Comments
Top 10000 WebSites ~2300 Site Crawler live data, not testing
EdgeHTML Regression Suite ~2500 Filtered, incidental usage not counted
Top WebDriver Enabled Suites ~2200 Web Driver enabled, any browser can run
CSS 2.1 Suite ~70 Not an API testing suites

But these numbers don't mean anything, how many APIs are there? I can't give the number for every browser, but I can tell you that EdgeHTML is actually on the low side due to how our type system works. We've adopted an HTML 5 compliant type system earlier than some of the other browsers. We've also worked hard to reduce vendor prefixes. For this reason, Safari, may have over 9k APIs detectable in the DOM. FireFox, which has a high level of standards compliance, will often have may moz prefixed properties and therefore their numbers will be elevated as well. 

But for EdgeHTML the statistics are that we have ~4300 APIs and ~6200 entry points. The entry points is important since these represent things that aren't necessarily visible to the web developer. The way we do callable objects (objects that behave like functions) for instance. Or the fact that read-write properties have both a getter and a setter which must be tested.

The numbers we collected on testing have been aggregated to APIs, but we also keep the data on entry-points. The entry-point data always results in a lower percentage of testing, since not testing a getter or setter can only take away from your percentage it can't add to it.

So what if we added up all of the unique coverage? If we do that across the suites, then we end up with ~3400 APIs covered. That's not bad, we are starting to show that the "used web" might heavily intersect in that juncture right? Okay, lets run a few more numbers then.

What about unique to the top 10000 web sites? Turns out there are ~600 APIs in that group. Ouch, that could be an interop nightmare. If we instead use just publicly available testing (and remove our proprietary testing) the numbers jump up significantly. We are now over 1200 APIs that are hit by live sites which are not tested by public test suites... Ugh.

Confidence in Existing Tests

So we've talked a lot about the testing that does and does not exist, but what about the confidence we apply to the existing testing. This is going to be less data driven since this data is very hard for us to mine properly. It is, in fact, a sad truth that our tracking capabilities are currently lacking in this area and it creates a nasty blind spot. I'm not aware of this state for other browser vendors, but I do know that the Blink project does a lot of bug tagging and management that I would argue is superior to our own. So kudos to them ;-)

So notice that the EdgeHTML test suite is considered a regression suite. This is an evolution of testing that started likely 20 years ago. While we have added some comprehensive testing for new features, many older features, the old DOM if you will, only has basic testing which is mostly focused on verifying the API basically works, that it most certainly doesn't crash and often contains some set of tests that were previously bugs in the API that we eventually fixed and created a "regression" test for. In IE 9, which is where this legacy DOM suite forks into a legacy and modern suite, we carry over about half of our entry points. I have numbers from this period that range between 3100 and 3400 entry points depending on the document mode.

Given we mostly have regression suites, we also find that many APIs are only hit a small number of times. In fact, our initial test numbers were quite high, around 60% entry point coverage, until we factored out incidental hits. Once the filtering was employed we were back down at 30% and that is a high amount, since there is no perfect way to filter out incidental usage.

This all combines to make my confidence in our regression suite at about 50/50. We know that many of the tests are high value because they represent actual bugs we found on the web in how websites are using our APIs. However, many of them are now more than 5, 10 or even 15 years old.

What about the public web-test suites that we covered? Well, that I'm going to also call 50/50 because we aren't really seeing that those suites touch the same APIs or test the APIs that the web actually uses. I mean, 1200 APIs not hit by those suites is pretty telling. Note, we do add more public test suites to our runs all the time, and quality of a suite is not purely about how many APIs they test. There is a lot of value, for instance, in a suite that only tests one thing extremely well. This requires that you grade the suites quality through inspection. This is a step we have not done. We rely on jQuery and other major frameworks to vet their own suites and provide us with a rough approximation of something that matches their requirements.

So in all, given that our existing suites aren't very exhaustive, they often test very old functionality, and there are a lot of missing APIs in the suites that are used by top sites, I'd subjectively rank the current test suites in my low confidence category. How can we get a medium to high confidence suite, that is cross vendor/device capable, and if we could what would it look like?

Tools for a Public Suite

We already have some public suites available. What makes a suite public? What allows it to be run against multiple browsers? Let's look at some of the tools we have available for this.

First and foremost we have WebDriver, a public standard API set for launching and manipulating a browser session. The WebDriver protocol has commands for things like synthetic input, executing script in the target and getting information back from the remote page. This is a great start for automating browsers in the most basic ways. It is not yet ready to test more complicated things that are more OS specific. For instance, synthetic input is not the same as OS level input. Even OS level synthetic input is not the same as input from a driver within the OS. Often times we have to test deeply to verify that our dependency stacks work. However, the more we do this, the less likely the test is going to work in multiple browsers anyway. Expect advancements in this area over the coming months and years as web testing gets increasingly more sophisticated and WebDriver evolves as a result.

But WebDriver can really test anything in an abstract way. In fact, it could test an application like Notepad if there were a server for Notepad that could create sessions and launch instances. Something that generic doesn't sound like a proper tool, in fact it is just part of a toolchain. To make WebDriver more valuable we need to specify tests in ways that anyone's WebDriver implementation can pick them up, run them, and report results.

For this we have another technology called TestHarness.js. This is a set of functionality that when imported into your HTML pages allows them to define suites of tests. By swapping in a reporting script, the automation can then change the reporting to suit its own needs. In this way you write to a test API by obeying the TestHarness.js contract, but then you expect that someone running your test will likely replace these scripts with their own versions so that they can capture the results.

The final bit of the puzzle is enabling tests to be broadly available and usable. Thankfully there is a great, public, open source solution to this problem in GitHub. By placing your tests in a public repository and using TestHarness.js, you can be rest assured that a browser vendor can clone, run and report on your test suites. If your test suite is valuable enough it can be run by a browser vendor against their builds, not only protecting them from breaking the web, but protecting your from being broken. For large frameworks in broad usage across the web this is probably the best protection you could have to ensure your libraries continue to work as the web evolves.

The Vendor Suite

Hopefully the previous section has replaced any despair that I might have started at the beginning of the article with some Hope instead. Because I'm not hear to bum anyone out. I'm hear to fix the problem and accelerate the convergence of inteoperability.

To this end, my proposal is what I call the vendor suite. And note, I call it this, but it doesn't matter if a vendor releases this suite or someone in the community, but it has to be informed by the vendors and use data from the actual browsers in order to achieve a level of completeness the likes of which we have never seen. Also, these are my own opinions, not those of the my entire team which increases the likelihood that a large undertaking such as this may be something driven by the community itself and in turn be supported by the vendors.

The qualities of the vendor suite are that it accomplishes the following immediate goals...

  1. We define tests in 3 levels
    1. API - Individual API tests, where each API has some traits and some amount of automated and manually created testing is involved. The level of testing is at a single API, and the number of tests is dependent on the complexity of the API itself.
    2. Suite - A more exhaustive set of tests which focus on a specification or feature area. These suites could accompany a specification and be part of their contribution to the standards process.
    3. Ad Hoc - Tests which represent obscure, but important to test behaviors. Perhaps these behaviors are only relevant to a given device or legacy browser. Potentially their goal is to "fail" on a compliant and interoperable browser.
  2. Tests should be written to verify the interoperable behavior until all browsers agree to transition to the standards behavior if their is a discrepancy. Interoperability is the primary goal. Do websites work? Not, does a browser adhere to a spec for which no web developer is pining.
  3. As much testing should be driven by traits as possible.
    1. Attributes - At an API level the behavior of a property - attribute pair will be generically tested. Examples of common behavior are how the property - attribute reflect one another. Also, how MutationObserver's are fired.
    2. Due to traits based testing we can verify down to the ES 6 level the configuration of the property. This allows auto-generation of testing for read-only, "use strict" adherence, along with other flags such as enumerability and configurability.
    3. Tests should be able to be read and rewritten while maintaining any hand-coded or manually added value. Things like throwing exceptions are often edge cases and would need someone to specifically test for those behaviors.
  4. No vendor will pass the suite. It is simply not possible. For example:
    1. Vendor prefixed APIs will be tested for.
    2. Hundreds if not thousands of vendor specific APIs, device configurations, scenario configurations, etc... may be present.
  5. To this end, the baselines and cross comparisons between browsers are what is being tested for. Vendors can use the results to help drive convergence and start talks on what we want the final behavior to be. As we align to those behaviors the tests can be updated to reflect our progress.
Having such a large scale, combined effort is really how we start converging the web. This is how we all start quickly working through obscure differences with a publicly visible suite, publicly contributed to, and to be honest, prioritized by you, the community. With such obscure differences out of the way it frees up our resources to deliver new web specifications and further achieve convergence on new APIs much more quickly. 

Final Thoughts

So what do you think? Crazy idea? Tried a thousand times? Have examples of existing collateral that you think accomplishes this goal? I'm open to all form of comment and criticism. Any improvements to the web that let me get out of the business of reducing web pages and finding issues earlier is going to be a huge win for me regardless of how we accomplish it. I want out of the business of trying to understand how a site is constructed. There are far more web developers than there are browser developers so the current solutions simply don't scale in my opinion.

As another thought, what about this article is interesting to you? For instance, are the numbers surprising or about what you thought in terms of how tested the web is? Would you be interested in understanding or seeing these numbers presented in a more reliable and public way?