Sunday, November 22, 2015

From State of the Art to the State of Decay

I'm constantly in meetings where we are discussing 10+ year old code that "smart people wrote" so it must be fairly good. It seems the people take offense when you call 10 year old code bad names especially if they were somehow connected.

Let me start with myself. I've been working on the Internet Explorer code base for over 10 years, having started just before IE 7 Beta 1 was shipped. Instead of referring to other people's terrible code, I'll refer to my own terrible code. I'll also talk about our thinking on "State of the Art" at the time and how that leads to now where the same code which was the bees knees is now just a skeleton of its former self having suffered through the "State of Decay".

State of the Art in 2005

We have a legacy code base that has been dusted off after many years of neglect (IE 6 had been shuttered and IE 7 was the rebirth). The ultimate decay, many years of advancements in computer science, none of which were applied to one of the largest and most complex code bases you could conceive of. We had some new tools at the time, static analysis tools, which were very good at finding null pointer dereferences alongside actual security bugs. We had tens of thousands (maybe even hundreds of thousands) across all of Windows that needed to be investigated, understood and fixed.

Our ideas of state of the art at this time were that the code shouldn't crash. Ever. Reliability in our minds was error recovery and checking null's all over the place and protecting against the most malicious of external code since after being shuttered we had no idea who or how many consumers of our code there was. Fixing any single line of code presented us with a challenge. Did we break someone? In this state every line of code could take hours to days to understand the ramifications of adding even a single null check and bail-out. And spend hours and days we did.

To the extent that I personally introduced hundreds, perhaps thousands of null pointer checks in the code. After all, this was the state of the art at the time. We wanted stability, reliability and we wanted most of all to shut down those rare cases where the tool was pointing out an actual security hole as well. Of course I thought I was doing the right thing. My code reviewers did too. We all slept well at night getting paid to put in null pointer checks to work around the fact that our invariants were being broken constantly. By doing all of this work we were forever changing our ability to use those crashes we were fixing to find actual product bugs. "State of the Art" indeed. So how should those conversations about 10+ year old code be going? Should I really be attached to those decisions I made 10 years ago or should I evolve to fix the state of decay and improve the state of the art moving forward?

State of Decay in 2015

Those decisions 10 years ago have led to a current state of decay in the code. Of those thousands of null pointer checks, how many remain? When new developers look at the code, what should they glean from the many if-checks for conditions that may or may not be possible? The cognitive load it took me to put one in was hours and even sometimes days. What about the cognitive load to take one out?

It is now obvious that the code is no longer state of the art, but bringing the code up to similar quality as new code is an extreme amount of work. It is technical debt. To validate the technical debt we talk about how long that code has been executing for and "not causing problems" and we bring up those "smart people" from the past. Yes we were smart at the time and we made a local optimum decision to address a business need. It doesn't mean that maintaining that decision is the best course of action to resolve future business needs (in fact it is rarely the case that leaving anything in a state of disrepair is a good decision as the cost of deferring the repair of decay is non-linear).

I gave a simple example of null pointer checks to address reliability. There were many more decisions made that increased our technical debt versus the current industry state of the art. Things like increasing our hard dependencies on our underlying operating system to the extent that we failed to provide good abstractions to give us agility in our future technology decisions (I'm being nice, since we also failed to provide abstractions good enough to even allow for proper unit testing). We executed on a set of design decisions that ensured only new code or heavily rewritten code would even try to establish better abstractions and patterns for the future. This further left behind our oldest code and it meant that our worst decisions in the longest state of decay were the ones that continued accruing debt in the system.

Now there are real business reasons for some of this. Older code is more risky to update. Its more costly to update. It requires some of your most expert developers to deal with, developers you'd like to put on the new shiny. Not everyone's name starts with Mike and ends with Rowe and so you can risk losing top talent by deploying them on these dirtiest of jobs. Those are valid reasons. Invalid reasons are that smart people wrote the code 10+ years ago or that when you wrote that code it was on state of the art thinking. As we've noted, the state of the art changes, very quickly, and the new code has to adapt.

Start of the Art in 2015

Its useful then to look at how many "bad decisions" we made in the past since if we could have told the future, we could have made the 2015 decision and been done with all of this talk. Our code could be state of the art and our legacy could be eliminated. So what are some of those major changes that we didn't foresee and therefore have to pay a cost now and adapt?

Well, reliability went from trying to never crash, to crash as soon as something comes up that you don't have an answer for. Don't recover from memory failures. Don't recover from stack overflow exceptions. Don't recover from broken invariants. Crash instead. I recently posted on our adoption of the fail-fast model in favor of our error recovery model and the immense value we've already gotten from this approach. You can read about that here, "Improving Reliability by Crashing".

Security has seen an uptick in the number of exploits driven by use after free type conditions. Now, years ago a model was introduced to fix this problem, and it was called Smart Pointers. At least in the Windows code base where COM was prevalent this was the case. Even now in the C++ world you can get some automatic memory management using unique_ptr and shared_ptr. So how did that play out? Well, a majority of the code in the district of decay was still written using raw pointers and raw AddRef/Release semantics. So even when the state of the art became Smart Pointers, it wasn't good enough. It wasn't deployed to improve the decay, the decay and rot remained. After all, having a half and half system just means you use the broken 50% of the code to exploit the working 50%.

So basically "smart pointers" written by "smart people" turned out to not be the final conclusion of the state of the art when it came to memory lifetime and use after free. We have much, much better models if we just upgrade all of the code to remove the existing decay and adopt them. You can read about our evolution in thinking as we advanced through "Memory Protector" and on to "MemGC". If you read through both of those articles one commonality you'll find is that they are defense in depth approaches. They do completely eliminate some classes of issues, but they don't provide the same guarantees as an explicitly reference counted system that is perfect in its reference counting. They instead rely on the GC to control only the lifetime of the memory itself but not the lifetime of its objects.

So there is yet another evolution in the state of the art still to come and that is to move the run-time to an actual garbage collector while removing the reference counting semantics that still remain to control object lifetime. Now you can see how rapidly the state of the art evolves.

We've also evolved our state of the art with regard to web interoperability and compatibility. In IE 8 for the first time we introduced document modes after were created a huge mess for the web in IE 7 where our CSS strict mode changed so much from IE 6 that we literally "broke the web". Document modes allowed us to evolve more quickly while providing legacy sites on legacy modes the ability to continue running in our most modern platform. This persisted all the way until IE 11 where we then had 5 different versions of the web platform all living within the same binary and the number of versioning decisions was reaching into the tens of thousands. One aspect of this model was our Browser OM which I've recently documented some aspects of, "JavaScript Type System Evolution", if you are interested in going deeper.

So the state of the art now is EdgeHTML and an Evergreen web platform. This in turn allows us to delete some of the decay, kind of like extracting a cavity, but even this carries so much risk that the process is slow and tedious. The business value of removing inaccessible features (or hard to access features) becomes dependent on what we achieve as a result. So even the evergreen platform can suffer from decay if we aren't careful and still there are systems that undergo much head scratching as we determine whether or not they are necessary. But at least the future here is bright and this leads me to my final state of the art design change.

The industry has moved to a much more agile place of components, open source code, portable code and platform independent code. Components that don't have proper abstractions lack agility and are unable to keep pace with the rest of the industry. There are many state of the art improvements that could apply to this space, but I'm going to call out one that I believe in more than others, and that is replacing OS dependencies, abstractions and interfaces with the principles of Capabilities. And I mean Capabilities in terms of how Joe Duffy refers to them on his recent blogging on the Midori project. I plan on doing a deep session on Capabilities and applying them to the Web Platform in a future article, so I won't dive in further here. This is also a space where, unlike the above 3 state of the art advancements, I don't have much progress I can point to. So I hope to provide insights on how this will evolve the platform moving forward rather than how the platform has already adopted state of the art.

Accepting the Future

When it comes to legacy code we really have to stop digging in our heels. I often find people with the mind set that a given bit of code is correct until it is proven incorrect. They take the current code as gospel and often copy/paste it all over the place, using the existing code as a template for every bit of new code that they write. They often ignore the state of the art, since there aren't any examples in the decaying code they are working on. It is often easier to simply accept the current state of that code than it is to try and upgrade it.

I can't blame them. The costs are high and the team has to be willing to accept the cost. The cost to make the changes. The cost to take the risks. The costs to code review a much larger change due to the transformation. The costs to add or update testing. They have to be supported in this process and not told how smart developers 10 years ago wrote that so it must have been for a reason. If those developers were so smart, they might have thought about writing a comment explaining why they did something that looks absolutely crazy or that a future developer can't understand by simply reading it.

You could talk about how they were following patterns and at the time everyone understood those patterns. Well, then I should have documents on those patterns and I should be able to review those every year to determine if they are still state of the art. But the decay has already kicked in. The documents are gone, the developers who wrote the code don't remember themselves what they wrote or what those patterns mean anymore. Everyone has moved on, but the code hasn't been given the same option. Your costs for that code are increasing exponentially and there is no better time than now to understand it and evolve it to avoid larger costs in the future.

We should also all agree that we are now better than ourselves 10 years ago and its time to let the past go. Assume that we made a mistake back then and see if it needs correcting. If so, correct it. If not, then good job, that bit of code has lived another year.

If you are afraid of the code, then that is even more reason to go after it. I often here things like, "We don't know what might break." Really? That is an acceptable reason? Lack of understanding is something we accept in our industry as a reason to put yellow tape around an area and declare it off limits. Nope. Not a valid reason.

The next argument in this series is that the code is "going away". If its "going away", then who is doing that? Because code doesn't go anywhere by itself. And deleting some code is still an action that a developer has to perform. This is almost always another manifestation of fear towards the code. By noting it will "go away" you defer the decision of what to do with that code. And in software, since teams change all the time, maybe you won't even own the code when the decision is has to be made.

Conclusion

When it comes to evolving your code to the state of the art don't be a blocker. Accept that there are better ways to do whatever you did in the past and always try to be truthful when computing the value proposition when you decide not to upgrade a piece of code. Document that decision well and include a reasoning around what additional conditions would have swung your decision (you owe it to your future self). Schedule a reevaluation of the decision for some time in the future and see if decisions have changed. The worst case is you lose the analysis and the cost of analysis becomes exponential as well. If that happens you will surely cement that decay into place for a long time.

Try to avoid having your ego or the ego's of other make decisions about the quality of code from 10, 5 or even 2 years ago. Tech is one of the fastest evolving industries and it is a self reinforcing process. As individuals it is hard to keep up with this evolution and so we let our own biases and our own history slow advancement when the best thing we can do is getting out of the way of progress. Our experience is still extremely valuable, but we have to apply it towards the right problems. Ensuring that old code stays old is not one of them.

Cleaning up the decay can be part of the culture of your team, it can be part of the process, it can be rewarded or it can just be a dirty job that nobody wants to do. How much decay you have in your system will be entirely dependent on how well you articulate the importance of cleaning it up. The distance between state of the art and state of decay can be measured in months so it can't be ignored. You probably have more decay than you think and if you don't have a process in place to identify it and allocate resources to fix it, it is probably costing you far more than you realize.

1 comment:

  1. This reminds me of https://technet.microsoft.com/library/security/ms06-021, with the quote "exception handlers are removed from Internet Explorer with this update."

    ReplyDelete