Lab guidance

Hardness of assignments

Each task indicates how difficult it is:

These times are rough estimates of our expectations.

The tasks in general require not many lines of code (a few hundred lines), but the code is conceptually complicated and often details matter a lot. Some of the tests are difficult to pass.

Don't start a lab the night before it is due; it's much more time efficient to do the labs in several sessions spread over multiple days. Tracking down bugs in distributed system code is difficult, because of concurrency, crashes, and an unreliable network. Problems often require thought and careful debugging to understand and fix.


Debugging as a formal process

Debugging complex concurrent systems is not magic. Debugging is a science and a skill, and it can be learned and practiced. When done well, debugging is a process that will lead you inerringly from an initial observation of an error back to the exact fault that caused that error. In general, most bugs in complex systems cannot be solved using ad-hoc or "guess-and-check" approaches; you are likely to have the most success by being as methodical in your approach as possible.

Helpful definitions:

You can debug forwards (from fault to error), or backwards (from error to fault). The forward approach tends to devolve into a "guess-and-check" approach, and frequently does not allow for consistent or repeatable progress. By contrast, the backwards approach is always applicable to the 6.824 labs, and allows for making consistent progress, so we will focus on it here.

The backwards approach is always applicable for our labs because you always have an observable error to start with: the test cases themselves function as a form of instrumentation, which will detect certain kinds of errors and surface them to you.

Working backwards to find an error

The execution of a program that you are debugging can generally be to have three phases:

  1. The phase in which the execution of the program is correct and error-free. A fault may exist in the program, but it has not yet been manifested as an actual error.
  2. The phase in which a fault has manifested, and has produced one or more latent errors in the program state. These are not yet visible, and may silently propagate within the program state.
  3. The phase in which the latent errors have surfaced, and become observable errors. At this point, you have an indication that there's something wrong with the running program.

The goal of debugging is to narrow in on the exact location of the fault within your implementation. This means you need to identify the point in time when the fault is manifested, and the program moves from phase 1 to phase 2. It's hard to find the exact extent of phase 1, because it is generally prohibitively difficult to analyze the complete state of a program in sufficient detail to be certain that a fault has not yet manifested itself at any point in time.

Instead, the best approach is usually to work backwards and narrow down the size of phase 2 until it is as small as possible, so that the location of the fault is readily apparent. This is done by expanding the instrumentation of your code to surface errors sooner, and thereby spend less time in phase 2. This generally involves adding additional debugging statements and/or assertions to your code.

Making progress

When adding instrumentation, you want to focus on making clear and deliberate progress at narrowing down the cause of the fault. You should identify the first observable error you can in your debugging output, and then attempt to narrow down the most proximate cause of that error. This is done by forming a hypothesis about what the most proximate cause of the error could be, and then adding instrumentation to test that hypothesis. If your hypothesis is true, then you have a new "first observable error" and can repeat the process. If false, then you must come up with a new hypothesis about the proximate cause and test that instead.

It's important to maintain a sense of exactly what your current first observable error is, as you move backwards, so that you don't get lost among other errors that may be observable in your output.

At the beginning, you don't know anything about where the fault could be; it could be anywhere in your entire program. But as you advance, the possible interval in which the fault could be will narrow, until you reach a single line of code that must contain the fault. When you're down to the execution of a single function or a single block of code, it can be helpful to use a "bisection" approach, where you repeatedly add instrumentation halfway through the interval, and narrow down to one half or the other. (This is like binary search, but for finding bugs.)

Adding instrumentation

As a debugger, your main challenge is to pick the best locations for your instrumentation to best narrow down the location of your fault, along with deciding the most useful pieces of information to report in that piece of instrumentation. You can and should use your knowledge of your implementation to speed this up, but you must always check your assumptions by actually running your code and observing the results of your instrumentation.

Because instrumentation is so essential to your debugging process, you should take the time to design and implement it carefully. Consider some of the following questions when designing your instrumentation:

One specific note: make sure to learn about format strings if you aren't already familiar with them. You can refer to the Wikipedia article on Printf format strings, or to the Go-specific documentation. You will likely be much happier using functions like 'log.Printf' or 'fmt.Printf' than trying to achieve the same effects with multiple print functions.

You might also consider trying to condense each individual event you report into a single line to facilitate your ability to scan output quickly.

Tips on asking for help

As mentioned above, debugging is a complex and iterative process. It can be hard for TAs to help you debug your code over Piazza; we strongly encourage you to bring your difficult debugging challenges to Office Hours instead! We will be happy to help walk you through the debugging approaches that are best applicable to your situation. Of course, you are likely to get the most mileage out of your time in Office Hours if you make a significant attempt at debugging the problem yourself in as methodical a manner as possible.

Tips on timeouts

It should be noted that tweaking timeouts rarely fixes bugs, and that doing so should be a last resort. We frequently see students willing to keep making arbitrary tweaks to their code (especially timeouts) rather than following a careful debugging process. Doing this is a great way to obscure underlying bugs by masking them instead of fixing them; they will often still show up in rare cases, even if they appear fixed in the common case.

In particular, in Raft, there are wide ranges of timeouts that will let your code work. While you CAN pick bad timeout values, it won't take much time to find timeouts that are functional.

Additional debugging tips

Happy debugging!