Identify the Bug

     Debugging means removing bugs from programs. A bug is unexpected and undesirable behavior by a program.
     Occasionally there is a formal specification that the program is required to follow, in which case a bug is a failure to follow the spec. More frequently the program specification is informal, in which case people may disagree as to whether a particular program behavior is in fact a bug or not.
     As one of the people writing a program, you should be a source of bug reports. Don't simply rely on testers or users. If you notice something odd while running a program, it can be tempting to disregard it and hope that it will go away. This is especially true if you are working on something else at the time. Resist the temptation. Record the bug. If you have data files or logs, save them. If you have time later, return to the issue; otherwise, pass it on to somebody else. As one of the people most familiar with how the program is supposed to work, you are in the best possible position for detecting unexpected behavior.
     Once you have a bug report, the first step in removing the bug is identifying it. This is of particular importance when working with a bug report produced by somebody else, such as the testing team or a user. Some bugs are relatively obvious, as when the program crashes unexpectedly. Others are obscure, as when the program generates output which is slightly incorrect.
Many bug reports received from users are of the form "I did such and such, and something went wrong." Before doing anything else, you must find out what went wrong--that is, you must identify the bug by determining the program behavior which was unexpected and undesirable. Any attempt to fix the bug before understanding what went wrong is generally wasted time.
     Identifying a bug reported by a user typically requires getting the answer to two questions: "What did the program do?" and "What did you expect the program to do?" The goal is to determine precisely the behavior of the program which was unexpected and undesirable.
     Once the bug has been identified, the easiest and fastest way to fix it is to determine that it is not a bug at all. If there is a formal specification for the program, you may have to modify the specification. In other cases, you may have to modify the expectations of the user. This rapid fix is often known as declaring the behavior to be an "undocumented feature." Despite the obvious potential for abuse, this is in fact sometimes the correct way to handle the problem.
     Unfortunately, most bugs are real bugs, and require further work.

Replicate the Bug
     The first step in fixing a bug is to replicate it. This means to recreate the undesirable behavior under controlled conditions. The goal is to find a precisely specified set of steps which demonstrate the bug.
In many cases this is straightforward. You run the program on a particular input, or you press a particular button on a particular dialog, and the bug occurs. In other cases, replication can be very difficult. It may require a lengthy series of steps, or, in an interactive program such as a game, it may require precise timing. In the worst cases, replication may be nearly impossible.

Understand the Bug
     Once you are able to replicate the bug, you must figure out what causes it. This is generally the most time-consuming step.

Understand the program
     In order to understand a bug in a program, you must have some understanding of the program.
If you wrote the program, then you presumably understand it. If not, then you have more serious problems.
If you didn't write the program, you need to grasp its general structure. Most programs are organized in a fairly sensible fashion, once you know the general approach. If you are lucky, the general approach is documented, or you can ask the original designer.
     More commonly, you need to pull the structure out of the source code. The best approach is to start looking at the source code from the start of the program (e.g., the main function in a C program). Skim through the program, stepping down through functions, until you find the main center of action--in most programs, some sort of loop. This can normally be done fairly quickly. The nature of this center of action should tell you where to look in the source code for any particular activity. It should also tell you the general way in which the program acts.
     The worst cases are large programs written over many years by many different people. These often become a hodge-podge of different ideas with little consistency. The situation is depressingly common. You must simply do the best you can. At least try to avoid making the mess worse.
     A debugger can also be helpful when trying to understand a program. By running the program under the debugger and setting breakpoints, you may be able to see the dynamic behavior of the program. When you reach a breakpoint, look at the call stack to see how you got there, and look at key variables. Or if you don't reach a breakpoint you expected to reach, you've also learned something.

Locate the bug
The next step is to locate the bug in the program source code.
     There are two source code locations which you need to consider: the code which causes the visible incorrect behavior, and the code which is actually incorrect. It's fairly common for these to be the same pieces of code. However, it's also fairly common for these to be in different parts of the program. A typical example of this is when an error in one part of the program causes memory corruption which leads to visible bad behavior in a completely different part of the program. Do not let your eagerness to fix the bug mislead you into thinking that the code which directly causes the bad behavior is actually incorrect.
     Ordinarily you must first find the code which causes the incorrect behavior. Knowing the incorrect behavior, and knowing how the source code is arranged, will often lead you quickly to the part of the program which is at fault. Sometimes a quick scan of the source code is enough to identify the problematic code.
     Otherwise, narrowing down the bad behavior to a particular piece of code is where a debugger can be very useful. If you are lucky enough to have a core dump, a debugger can immediately identify the line which fails. Otherwise, judiciously setting breakpoints while replicating the bug can quickly hone in on the code you are after.
     Another useful approach is to add check routines to the code to verify that data structures are in a valid state. Such routines can help narrow down where data corruption occurs. If the check routines are fast, you may want to always enable them. Otherwise, leave them in the code, and provide some sort of mechanism to turn them on when you need them.
     In the specific case of a memory corruption bug, you may be able to replace the standard memory allocation routines with ones that perform various checks. For example, on GNU/Linux systems, read the malloc documentation to see how the environment variable MALLOC_CHECK_ can be used to do this.
     The final fallback for locating the source of the bad behavior is simple source code inspection. This is the only option if you can't replicate the problem. A clear understanding of the overall program source code is an absolute requirement for this to work. Unfortunately, a complex problem is nearly impossible to isolate by simply reading the source code. You will have to guess at likely possibilities, and try to trace through the code carefully to see if they are really problems.
     If you are very unlucky, the bug may not be in the program source code at all. It may be in a library routine, or in the operating system, or in the compiler. These cases are rare, and it is a mark of an inexperienced programmer to suspect a compiler bug too quickly. However, they do happen, so when all else fails, consider these possibilities. Verify a bug in a library routine or the OS by writing a check program, and verify a bug in the compiler by examining the machine code directly.

Locate the error
     Now that you have found the code which causes the bad behavior, you need to identify the actual coding error. Often they are the same code--that is, the coding error directly causes the bad behavior. However, you should always consider the possibility that the actual error is elsewhere.
     For example, the routine which causes the bad behavior may be behaving correctly, but be called with bad input, or at the wrong time. A coding error elsewhere may cause a data structure to hold unreasonable values. Another possibility is bad user input.
     The fix in such cases may be two-fold. You should, of course, fix the code which called the routine incorrectly or otherwise created the bad input data. In the case of bad user input, you should validate the input. In addition, however, you may want to add checks to the code which used the values. It should check for unreasonable input, and report an error or otherwise handle the error without causing invalid behavior.

Fix the bug
     The final step in the debugging process is, of course, to fix the bug. I won't discuss this step in detail, as fixing a bug is where you leave the debugging phase and return to programming. I'll just mention a couple of points.
     If you want a program which can be maintained in the future, then make sure you fix the bug in the right way. This means making a fix which fits in with the rest of the program, and which fixes all aspects of the problem, without introducing any new problems. Don't forget to update any relevant documentation.
     In some cases you may need a quick patch to fix an immediate problem. There is nothing wrong with doing that, as long as you take the time afterward to go back and make the right fix.
     Obviously, always test any fix you make by ensuring that you can no longer replicate the bad behavior. Don't forget to make sure that the program continues to pass its test suites. Consider extending the test suites to detect the case which you just fixed, to make sure it doesn't reappear.



Post a Comment

I made this widget at