Summary: This is a story about the most elusive and sinister software bug I ever came across in my decades-long career as a programmer.
The Mother of All Bugs
At some point early in my career I was working for a company that was developing a hand-held computer for the area of Home Health Care. It was called InfoTouch™. The job involved daily interaction with the guys in the hardware department, which was actually quite a joy, despite the incessant "It's a software problem!" -- "No, it's a hardware problem!" arguments, because these arguments were being made by well-meant engineers from both camps, who were all in search of the truth, without egoisms, vested interests, or illusions of infallibility. That is, in true engineering tradition.
During the development of the InfoTouch, for more than a year, possibly two, the device would randomly die for no apparent reason. Sometimes it would die once a day, other times weeks would pass without a problem. On some rare occasions it would die while someone was using it, but more often it would die while sleeping, or while charging. So, the problem seemed to be completely random, and no matter how hard we tried we could not find a sequence of steps that would reproduce it.
When the machine died, the only thing we could do was to give it to the hardware guys, who would open it up, throw an oscilloscope at it, and try to determine whether it was dead due to a hardware or a software malfunction. And since we software guys were not terribly familiar with oscilloscopes, we had to trust what the hardware guys said.
Luckily, the hardware guys would never say with absolute certainty that it was a software problem. At worst, they would say that it was "most probably" a software problem. What did not help at all was that one out of every dozen times that they went through the drill, they found that it did in fact appear to be a hardware problem: the machine was just dead; there was no clock, no interrupts, no electronic magic of the kind that makes software run. But what was happening the rest of the times was still under debate.
This situation was going on for a long time, and we had no way of dealing with it other than hoping that one day someone either from the software department or the hardware department would stumble upon the solution by chance. The result was a vague sense of helplessness and low overall morale, which was the last thing needed in that little startup company which was struggling to survive due to many other reasons having to do with funding, partnerships, competitors, etc.
Then one day as I was working on some C code somewhere in our code base, I stumbled by pure chance upon a function which was declaring a local variable of pointer type and proceeding to write to the memory location pointed by it without first initializing it. This is a silly little bug which is almost guaranteed to cause a malfunction, possibly a crash.
During the development of the InfoTouch, for more than a year, possibly two, the device would randomly die for no apparent reason. Sometimes it would die once a day, other times weeks would pass without a problem. On some rare occasions it would die while someone was using it, but more often it would die while sleeping, or while charging. So, the problem seemed to be completely random, and no matter how hard we tried we could not find a sequence of steps that would reproduce it.
When the machine died, the only thing we could do was to give it to the hardware guys, who would open it up, throw an oscilloscope at it, and try to determine whether it was dead due to a hardware or a software malfunction. And since we software guys were not terribly familiar with oscilloscopes, we had to trust what the hardware guys said.
Luckily, the hardware guys would never say with absolute certainty that it was a software problem. At worst, they would say that it was "most probably" a software problem. What did not help at all was that one out of every dozen times that they went through the drill, they found that it did in fact appear to be a hardware problem: the machine was just dead; there was no clock, no interrupts, no electronic magic of the kind that makes software run. But what was happening the rest of the times was still under debate.
This situation was going on for a long time, and we had no way of dealing with it other than hoping that one day someone either from the software department or the hardware department would stumble upon the solution by chance. The result was a vague sense of helplessness and low overall morale, which was the last thing needed in that little startup company which was struggling to survive due to many other reasons having to do with funding, partnerships, competitors, etc.
Then one day as I was working on some C code somewhere in our code base, I stumbled by pure chance upon a function which was declaring a local variable of pointer type and proceeding to write to the memory location pointed by it without first initializing it. This is a silly little bug which is almost guaranteed to cause a malfunction, possibly a crash.
To this day still I do not know (or do not remember) whether that early version of Microsoft C did not yet support warnings for this type of mistake, or whether the people responsible for our build configuration had such hubris as to believe that "we don't need no stinkin' warnings".
I quickly fixed the bug, and I was about to proceed with my daily work, when it occurred to me to take a minute and check precisely what were the consequences and ramifications of the bug before the fix.
First of all, I checked to see whether the function was ever being called, and it turned out that it was; however, the InfoTouch was running fine for 99.9% of the time, so obviously, due to some coincidence, the garbage that the pointer was initialized with was such, that writing to the memory address represented by that garbage did not seem to cause any problems.
Or did it?
I decided to see exactly what was the garbage that the pointer was being initialized with. To my astonishment, I discovered the following:
Function afunc() was invoking bfunc() and then cfunc().
Function afunc() was invoking bfunc() and then cfunc().
In cfunc(), the uninitialized pointer was in the 5th word of the stack.
In bfunc(), the 5th word in the stack had been used to hold a... date-time variable!
So, the uninitialized pointer contained a bit pattern that represented a date and time. This resulted in random memory corruption during different hours of the day and different days of the month. The function was not being invoked very frequently, so the memory corruption was building up slowly, until some vital memory location would be affected and the software would crash. It is amazing that the machine ever worked at all.
After this bugfix the InfoTouch never again experienced any problems of a similar kind.
After this bugfix the InfoTouch never again experienced any problems of a similar kind.
What do we learn from this? Warnings are your friend. Enable as many warnings as you can, and use the "treat warnings as errors" option to ensure that not a single warning goes unnoticed.
■
nice story...
ReplyDeleteThanks, Divyesh! C-:=
DeleteThanks, Mohit! C-:=
ReplyDeleteClassic :)
ReplyDelete