Wednesday, February 25, 2015

Attack of the Codes or the Phantom Menace

A few weeks back I talked about Undead Code, how it can erode your code base and how it can be brought back to life by an errant change. Recall that some of the criterion were that it be logically dead, awaiting someone to trigger the right parameters and bring it back to life. Today, after a short conversation with a co-worker on some "This has to be dead code", I went back to figure out why my normal tool-chain wasn't flagging it.

In doing so I found yet another, interesting, nearly invisible bit of undead code. This code is compiled, linked into objects, formed into libraries, and yet somehow, when pitched from the binary, doesn't leave a trace. No indication that it was ever part of your compilation unit. Normally I would be able to find discarded symbols with discard tracking options. But in this case, if there were 0 symbols used in the object, then it appears there is nothing to discard.

At first I thought that maybe there would be some other symbol I could use or some other indication, like an unused object record. In my search I couldn't find one. Then I figured, well, all of the used files actually show up. In fact, if you use a tool from the DIA SDK called Dia2Dump, then it has a files option (-f) that will print the contribution to each object and library from the files that were compiled. You can go a step further and even get line number contribution which can lead to some really interesting tools such as when the compiler can evaluate a constant expression in your code (that you think it somehow no constant) and it throws out either the if or else block as a result.

Lets stop at the files though. You use "dia2dump -f" to get the file contributions, then you compare this against the files in your source tree. What you end up with is a restricted set of files that are in your tree, but for whatever reason are not part of the final binary. You'll probably be amazed at what you find. I found the following types of items in case you are interested.

  1. Large helper templates for type-safe casting.
  2. Drawing helpers, used by another file that was no flagged as either a) Linker Dead Code or b) by my new file technique. In this case I believe they would have been identified once I started pruning the leaves.
  3. A sub-parser we no longer use.
  4. Some base class functions and helpers that weren't used or should have been marked pure.
Some of the code in order to compile, requires libraries and other dependencies not needed by any of the code that contributes to the binary. This allowed for further reductions in the build files which is always welcome.

As the title suggests I'm going to call this kind of code "The Phantom Menace" and I'll be improving my tool chains to report it accurately while at the same time figuring out why some additional dependent files weren't found.

--- Errata
If you want to remove build dependencies then setting your LINKER_FLAGS to /VERBOSE:UNUSEDLIBS can point out a lot of useless junk that you are loading and scanning. Using this with the Dia2Dump sample, you'll find uuid.lib is not needed apparently.

I can publish more of these tools, but internally I use a different tool than Dia2Dump. Also, the Dia2Dump that I just built out of the VS 2014 preview I have on hand, crashes on some inputs.

Sunday, February 22, 2015

Automated Tests, Failure, Retries and Confidence

For the past week I've been targeting a kind of long term problem for my team. We have a very large suite of tests that we run, but recently, we've achieved a level of failure that means we can't run the tests fully automated. Due to problems like VM setup and management, sporadic infrastructure failure and flaky tests, the confidence in the system has derailed and we resort to retrying failed tasks a couple of times before we declare a job truly failed.

This lack of confidence results in a marked and measurable increase in effort on behalf of the development team to process check-ins. On any given week, this might result in anywhere between a 7% to 45% decrease in job throughput. Obviously not being able to run unattended during off-business hours can greatly impact the achievable throughput. All is not lost though, since a few manual jobs the next day always seems to "play catch-up".

At what cost does that "catch-up" come? Well, we get similar (thought slightly lower) commit counts since we can do large merge jobs. The system processes the commits together though, so we get fewer data points for other tests that trigger from the committed build. Things like performance tests which run on the merge instead of on the individual commit being a perfect example. We now have a haystack and must look for needles when performance tests fail in these merges. We also get social changes. Instead of smaller incremental commits, devs will "stack up" multiple dependent commits into a single commit. This in turn breaks principles like "Check-in Early, Check-in Often" that tend to lead to easier diagnosis of code defects. Developers also begin to accept sporadic failure and retry a failure many times during a merge in the hopes it will "just pass" and let the merge through. This can cover up various timing conditions and concurrency issues. Those concurrency and timing issues can be both in the test themselves or in the code being tested, both of which contribute to further flakiness and issues later.

Failure Rates on Throughput

We had a bunch of historical data on the system, but it was draped in controversy. Which jobs failed due to bad changes? Which failed due to a sporadic test failure? How confident were we in classifying one way or the other? How do we detect infrastructure and VM setup problems from sporadic test failures? These should all be easy questions to answer if the data is clean, but in a system that processes hundreds of thousands of tests per day across hundreds of commits, the data is anything but clean. Instead we attacked the problem from the other direction and simply looked at the current corpus of tests and reported failure rates.

Our per check-in test infrastructure contains ~1200 test suites. Our ability to turn off tests within the suite is all or nothing, but a developer can choose to turn off a sub-test within the suite by manually committing a test update that comments out or otherwise removes the test. Initial data showed that ~150 tests were on our watch list. 80 were already disabled, some were run to collect data, but ignored on failure, and more were marked as potentially unstable meaning we had previously seen them fail at least once.

In addition to the ~150 already marked, we estimated another 50 tests which failed greater than 1 in 100 runs. Using these numbers, we tried to calculate the theoretical throughput of the queue. The following table shows two immediate observations. First, 200 flaky tests would really suck, but even at 100, which is less than our potential 70 + 50 additional for 120 total, would mean we only pass jobs 1 in 3 runs. Our live data was actually around the ~12% range so it meant that something else was still off. Either our infrastructure modelling was wrong, or we had more flaky tests than we though. Or both!

Test Pass Rate Test Variations
Disabled 1 0 100 180 195
Infrastructure 0.9999 1000 1000 1000 1000
Flaky Tests 0.99 200 100 20 5
Unattended Pass Rate ~12% ~33% ~74% ~86%

This at least gave us a good measure of how many flaky tests we could have and still achieve a decent throughput. It also gave us some options around impacting the test pass rate, such as automatic retry counts for some tests which could allow us to maintain some form of coverage while dramatically increasing the pass rate for those tests. We'll talk about why this can be bad later.

Retrying Tests

At this point, disabling a bunch of tests seems pretty viable, but we'd still be failing 1 in 4 jobs if we got down to say 20. Do we have any reason to leave those tests running? Potentially. Can we drive them to 0? Potentially. But what about those retry counts. If the probability a test will fail is 1%, then the possibility it would fail 2, or 3, or 4 times in a row gets increasingly smaller right? Well, statistically yes, assuming that the trials are independent.

The types of failures that a retry would help with might include, a timing or concurrency issue. A pre or post configuration issue (something in the environment that might change between test runs). A machine load issue, if the re-run is done at a time when the machine has more resources available.

The types of failures that a retry would not help with would be a persistent machine issue. In this case clearing the test to run on a different machine might be more applicable. If the entire VM infrastructure is constantly under load then it may drive the normal 99% pass rate down significantly that the even the retry failing is statistically significant.

So are retries good or bad? It all depends. They are bad if they are used as a crutch to allow flaky tests to persist in the system indefinitely. They are also bad if they allow chronic timing and concurrency issues to sneak threw over and over again, as they fly below the retry radar. If your team consistently relies on this approach, then eventually even a retry won't save the system. At 200 flaky tests you still have a 12% chance of failing. A team relying on this crutch can allow huge portions of their testing infrastructure to degrade to the point that even retries don't help anymore.

In our situation, where we have been 5-20 remaining tests, it might be worth enabling retries to drive the numbers up and increase developer confidence in the runs. For instance, if you know that the tests only fail due to specifically understood timing conditions, then you might make retry only when that occurs. Adding a retry of 2 for the above table could increase throughput to 90% all assuming that each test will fail or pass independently of the first test run.

Infrastructure Degradation

As we started identifying and disabling tests, we quickly found some categories of test failures. It turns out, in many cases, that the tests we thought were flaky were not flaky at all. The infrastructure itself had already failed. Turns out VMs are pretty finicky things and setting up unreleased OSes that change daily on them can be quite challenging. Sometimes the installation mechanisms change and your prerequisites are not present. Maybe an entire tools folder is missing. Maybe something as innocuous as a font failure. Whatever it might be, if the failure does not happen equally on every VM and either retrying or clearing can fix the problem, then you can easily overlook the fact that your infrastructure is flaky.

In addition, a lot of tests run on the flaky infrastructure. In fact, 1000 tests run so there is a 1% chance that one of your very stable tests will still fail. With the new knowledge of categories, it now becomes the likelihood of the infrastructure failure by the probability that a dependent test would then run on that machine. These equations get complicated pretty fast so I won't go into them, but it actually turns out, our infrastructure failure rate of 1 in 10,000 was probably closer to 1 in 2000 or even 1 in 1000. So even if every single test was awesome, our chances of passing start to fall rapidly.

Note, that retry counts for infrastructure problems is never a good idea. Since infrastructure would tend to cause a persistent test failure on that machine, only clearing the test to run on another machine would work. When you have large scale testing systems these types of things are usually possible. In our case, while possible, they are not trivial to configure. They also degrade confidence and have the same problem as using retries for flaky tests. How can you be sure you aren't allowing very rare timing conditions into the code?

Confidence

How to fix the problem and have confidence that you can maintain the throughput moving forward? After all, how did you get to a point where confidence was lost to begin with? That aside, the fastest way to confidence and throughput in our case was disabling and prioritizing the fixes for flaky tests. With all flaky tests disabled and various infrastructure issues categorized the amount of work to fix the infrastructure is also able to be scheduled and prioritized. While I don't know what our final unattended pass rates will be I can provide some insights into our short term throughput rates after having implemented these changes.

First and foremost our test coverage reductions were in the area of 0.8% of our test suite. This in turn will represent some set of actual code coverage by block and/or arc. In our case the arcs are important since they cover a lot of important edge cases. Note that this won't add up if you crunch the numbers. Suffice to say that during the same time we disabled some tests, we also fixed some infrastructure and some existing tests. Also, losing 0.8% of our test suite doesn't represent a number large enough that it lowers confidence in the suite itself. After all, these flaky tests were already being rerun and ignored since they cried wolf far too often.

Our statistically unsound pass rates versus test coverage was as high as 100% (not enough samples, but that was the daily number). This didn't account for some infrastructure issues though, which were as high as 70% failure rates for us (bad VMs that had to be diagnosed and then rebuilt). For 3 days though, all accounted for, no mucking with the numbers, we saw up to a 50% throughput. Not bad for a weeks work, to go from a fully manual check-in queue to a queue which can pass jobs cleanly at least half the time. This in turn raised our confidence in the suite significantly. We've already seen great response on failures since they represent something real and tangible, not simply a flaky test which needs to be rerun.

Conclusion

My major learning from this process has really been how important it is to maintain the stability of your testing systems and not let flakiness creep in. Even small sets of failures in the system can dramatically reduce your throughput. Small failures quickly allow the system to degrade further and there is a vicious, statistical model that eventually leads to a forced mode of manual operation. Once the system is manual it becomes a herculean effort to pull it back.

I would also say it seems counter-intuitive that to improve the system you have to begin turning off your test coverage. Removing devalued coverage in turn increases the value of the rest of the system. Further, once coverage is stable, remaining problems become infrastructural and fair game to correct as well.

This will be a continued effort for me. I'll be stabilizing, improving and fixing the test coverage until we are happy and confident with the throughput, infrastructure and test coverage. I would say it should be fun, but its more likely it will simply be enlightening and rewarding ;-)

Sunday, February 8, 2015

Gamifying FitForFood and Where are Apple and Microsoft?

Recently I got a message from FitBit to tell me about FitForFood. They are going to donate 1.5 million meals by the end of the month, assuming that members of the FitBit community can clock in 1 billion active calories. Doing the math, 1.5 million meals at 800 calories each, is 1 billion calories. They are going to donate $150,000 to pay for these 1.5 million meals (I had no idea an 800 calorie meal could be produced for 10 cents considering, my price for 800 calories is likely in the $20 range).

I've computed tons more numbers, I love Excel, and numbers, and Gamification, so I was inspired to spend quite a bit of time on this to really figure out, how in the hell can we gamify and solve hunger in the US?

My results are pretty dismal, since the fact sheets really show that $150,000 is a drop in the bucket. These guys at Feeding America are telling me there are 45 million hungry people (they use terms like food insecure, another nicer way to say hungry). Now, you can imagine that those people have SOME food, just not all of the food they need. You can also imagine they don't have a magical 10 pennies per meal option. So a simple calculation of 2 meals per day, times 45 million, times 365 days in a year could give us an upper bound on the problem at least. So upper bound, if we had a 10 cents per 800 calorie meal factory comes out to 3.2 billion a year. So we need to multiply this FitForFood thing by 21,000x times to solve the problem.

Who Can 21,000x This Thing

So then I was like, hey wait, this Apple company makes a metric ton of cash by selling devices that track active calories all the time. In fact, 3.2 billion is like a drop in the bucket. And if we can cut cost of a 10 cent meal even by a penny, the numbers drop fast. So actually, if we believe the marketing in FitForFood, we externalize all of the costs of finding the hungry people, preparing the food and distributing the food. Then there is in fact a solution, and it could come solely from Apple. The PR and goodwill, increased device sales, etc... from such a worthy showing of a company might actually make them more than 3.2 billion BACK. Wouldn't that be crazy.

Wait, Where is Microsoft Band

So one thing about all of the PR from this effort, is that you have to be in a position to drive sales since you need some way to keep up the good work. Things like FitBit and Microsoft Band are consumable devices. You break them, their batteries die after some time, you lose them, you wash them (though FitBit has lived through 3 washes and 2 dryings which is impressive as hell to me). So a highly popular, yet sold out device, which offers to help with hunger and poverty if you buy one and actively use it, like the Microsoft Band, could really get a boost here I'm thinking.

Instead of a $150,000 limit, put a per device limit of say $10 and challenge people to donate that way. That is like 80k active calories, 100 meals. For your users, that is over 250 gym visits or workouts (extrapolating based on a 300 calorie active burn during an average workout). And this stuff is inspiring. I mean I'm feeding people. So am I more likely to stay on the exercise plan, keep up the active calorie burn? I think so. I mean, I already walk more steps so I can rub it in the face of my 4 or 5 friends on FitBit. And now I can rub it in their face that I'm helping more hungry people than them.

Alas, as of the time of this writing, I've found it exceptionally difficult to even find a Microsoft Band. I've had a couple of friends tell me that they are available in some of the stores from time to time, but the online store has been out for a while. If you set out to build your brand on doing public good, you have to be prepared to meet demand, and sadly Microsoft Band just might not be the right device for this reason, no supply.

Gamification of FitForFood

So let's take it to the next level. While FitForFood is already a limited form of gamifcation, they've relied solely on celebrity engagement and goodwill. I'm not seeing "Active Calories" in my FitBit application. I should be. Show them to me. Inspire me to make more of them. I'm not getting new badges for it. At least I haven't earned any yet. I'm not seeing any challenges from friends over Active Calories. Why not? Why can't I challenge my friends to a feeding the hungry event. Who cares if I walked all over Seattle and back, instead let me show off that I RAN ALL over Seattle, just to feed hungry people.

What about multipliers? How about stairs are double days getting me to walk up more flights of stairs (or just count all stairs as Active Calories because we all know stairs are painful)? What about active donations? Why can't I sign my profile up so that people pay ME for calories I burn and that money goes to donate further meals to the hungry. What about everyone with a FitBit scale who loses a pound donates 3600 calories (my fat goes to feed the hungry? Yes please!)? The possibilities are endless here.

Badges, achievements, challenges, friendly competitions, this FitForFood thing is clearly at the cusp of being something great, if only it added more facets of interaction that make games and game-like applications sticky.

The Challenge

So now I want to know. What does it take? How do you build on FitForFood and create a sustained economy that can help feed the hungry and not bankrupt the company that is driving the campaign? You could imagine extending this to other social problems as well. How many people ever year want to donate, but find it difficult to find out how or get their tax breaks etc... How do we bring this all together and take advantage of Internet scale, Internet Speeds and Internet Costs? And most importantly how do we take advantage of Gamification to make the rewards intrinsic, increase engagement, and improve longevity of interaction.

I'm kind of jazzed her if you can't tell. If you know of any NPO's that are taking advantage of gamification like this contact me and let me know. I'm curious how successful they can be, how large they can scale, and truly how global they can be.