2015-10-21

The Mother of All Bugs

Michael Belivanakis 2015

At some point in my career I was working for a company that was developing a hand-held computer for the area of Home Health Care. It was called InfoTouch™. The job involved daily interaction with the guys in the hardware department, which was actually quite a joy, despite the incessant "it's a software problem --no, it's a hardware problem" arguments, because these arguments were being made by well-meant engineers from both camps, who were all in search of the truth, without sentimentalisms, egoisms, vested interests, or illusions of infallibility. That is, in true engineering tradition.

During the development of the InfoTouch, for more than a year, possibly two, the device would randomly die for no apparent reason.  Sometimes it would happen once a day, other times weeks would pass without a problem. When it happened, no matter how hard we tried, we could never reproduce it.  Also, some times it would die while someone was using it, but other times we would come into the office in the morning to find that it had died during the night, while sitting on its cradle, doing nothing but charging.

When the machine died, the only thing we could do was to give it to the hardware guys, who would open it up, throw an oscilloscope at it, and try to determine whether it was dead due to a hardware or a software malfunction. And since us software guys were not terribly familiar with oscilloscopes, we had to trust what the hardware guys said.

Luckily, in true engineering tradition, the hardware guys would never say with absolute certainty that it was a software problem.  At worst, they would say that it was "most probably" a software problem.  What did not help at all was that one out of every dozen times that they went through the drill, they found that it did in fact appear to be a hardware problem: the machine appeared dead; there was no clock, no interrupts, no electronic magic of the kind that makes software run.  But what was happening the rest of the times was still under debate.

This situation was going on for a long time, and we had no way of solving the problem other than hoping that we would one day stumble upon the solution by chance.  Heck, we were not even sure that it was a problem with our code to begin with.  The result was a vague sense of helplessness and low overall morale, which was the last thing needed in that little startup company which was struggling to survive due to many other reasons having to do with funding, partnerships, competitors, etc.

Then one day as I was working on some C code somewhere in our code base, I stumbled by pure chance upon a function which was declaring a local variable of pointer type and proceeding to use it without first initializing it. Specifically, it was writing a machine word to whatever memory happened to be pointed by it. A silly little bug which is almost guaranteed to cause a malfunction, very possibly a crash, probably every time the function is invoked. To this day still I do not know (or do not remember) whether that early version of Microsoft C did not yet support warnings for these types of things, or whether the people responsible for our build configuration had such hubris as to believe that "we don't need no stinkin' warnings".  I quickly fixed the bug, and I was about to proceed with my daily work, when it occurred to me to take a minute and check precisely what were the consequences and ramifications of the bug before the fix.

First of all, I checked to see whether the function was ever called, and it turned out that it was. But the InfoTouch was for the most part running fine, so obviously, due to some coincidence, the garbage that the pointer was initialized with was such that writing to it did not cause problems.  Or did it?  I decided to see exactly what was the garbage that the pointer was being initialized with. To my astonishment, I discovered the following:

Function afunc() was invoking functions bfunc() and cfunc().  The uninitialized pointer was in cfunc().


Function cfunc() accepted 1 word-sized argument and declared 2 word-sized local variables, of which the 2nd was the problematic pointer.  Function bfunc() had 2 word-sized arguments and 1 word-sized local variable, which was a... date-time variable!

So, the problematic pointer was being initialized with a bit pattern that represented a date and time. This resulted in random memory corruption during different hours of the day, different days of the month, and different years of the century. The function was not being invoked very frequently, so the memory corruption was building up slowly, and depending on the date, different areas of memory were being corrupted.  It is amazing that the machine ever worked at all.

After this bugfix the InfoTouch never again experienced any problems of a similar kind, not even the ones that the hardware engineers believed were due to hardware.  It could be that by that time they had fixed their hardware bugs, or it could be that even the dead appearance of the hardware was ultimately caused by the malfunctioning software.

4 comments: