Each task indicates how difficult it is:
Easy: A few hours.
Moderate: ~ 6 hours (per week).
Hard: More than 6 hours (per week). If you start late, your solution is unlikely to pass all tests.
The tasks in general require not many lines of code (a few hundred lines), but the code is conceptually complicated and often details matter a lot. Some of the tests are difficult to pass.
Don't start a lab the night before it is due; it's much more time efficient to do the labs in several sessions spread over multiple days. Tracking down bugs in distributed system code is difficult, because of concurrency, crashes, and an unreliable network. Problems often require thought and careful debugging to understand and fix.
Debugging complex concurrent systems is not magic. Debugging is a science and a skill, and it can be learned and practiced. When done well, debugging is a process that will lead you inerringly from an initial observation of an error back to the exact fault that caused that error. In general, most bugs in complex systems cannot be solved using ad-hoc or "guess-and-check" approaches; you are likely to have the most success by being as methodical in your approach as possible.
You can debug forwards (from fault to error), or backwards (from error to fault). The forward approach tends to devolve into a "guess-and-check" approach, and frequently does not allow for consistent or repeatable progress. By contrast, the backwards approach is always applicable to the 6.824 labs, and allows for making consistent progress, so we will focus on it here.
The backwards approach is always applicable for our labs because you always have an observable error to start with: the test cases themselves function as a form of instrumentation, which will detect certain kinds of errors and surface them to you.
The execution of a program that you are debugging can generally be to have three phases:
The goal of debugging is to narrow in on the exact location of the fault within your implementation. This means you need to identify the point in time when the fault is manifested, and the program moves from phase 1 to phase 2. It's hard to find the exact extent of phase 1, because it is generally prohibitively difficult to analyze the complete state of a program in sufficient detail to be certain that a fault has not yet manifested itself at any point in time.
Instead, the best approach is usually to work backwards and narrow down the size of phase 2 until it is as small as possible, so that the location of the fault is readily apparent. This is done by expanding the instrumentation of your code to surface errors sooner, and thereby spend less time in phase 2. This generally involves adding additional debugging statements and/or assertions to your code.
When adding instrumentation, you want to focus on making clear and deliberate progress at narrowing down the cause of the fault. You should identify the first observable error you can in your debugging output, and then attempt to narrow down the most proximate cause of that error. This is done by forming a hypothesis about what the most proximate cause of the error could be, and then adding instrumentation to test that hypothesis. If your hypothesis is true, then you have a new "first observable error" and can repeat the process. If false, then you must come up with a new hypothesis about the proximate cause and test that instead.
It's important to maintain a sense of exactly what your current first observable error is, as you move backwards, so that you don't get lost among other errors that may be observable in your output.
At the beginning, you don't know anything about where the fault could be; it could be anywhere in your entire program. But as you advance, the possible interval in which the fault could be will narrow, until you reach a single line of code that must contain the fault. When you're down to the execution of a single function or a single block of code, it can be helpful to use a "bisection" approach, where you repeatedly add instrumentation halfway through the interval, and narrow down to one half or the other. (This is like binary search, but for finding bugs.)
As a debugger, your main challenge is to pick the best locations for your instrumentation to best narrow down the location of your fault, along with deciding the most useful pieces of information to report in that piece of instrumentation. You can and should use your knowledge of your implementation to speed this up, but you must always check your assumptions by actually running your code and observing the results of your instrumentation.
Because instrumentation is so essential to your debugging process, you should take the time to design and implement it carefully. Consider some of the following questions when designing your instrumentation:
(In particular, consider using an approach like the provided DPrintf function does, and defining one or more constant boolean flags to turn on or off different aspects of your instrumentation.)
The best approach will be personalized to the particular way that YOU best percieve information, so you should experiment to find out what works well for you.
Can you build your own helper functions, so that a common set of data (current server, term, and role, perhaps?) will always be displayed?
One specific note: make sure to learn about format strings if you aren't already familiar with them. You can refer to the Wikipedia article on Printf format strings, or to the Go-specific documentation. You will likely be much happier using functions like 'log.Printf' or 'fmt.Printf' than trying to achieve the same effects with multiple print functions.
You might also consider trying to condense each individual event you report into a single line to facilitate your ability to scan output quickly.
It should be noted that tweaking timeouts rarely fixes bugs, and that doing so should be a last resort. We frequently see students willing to keep making arbitrary tweaks to their code (especially timeouts) rather than following a careful debugging process. Doing this is a great way to obscure underlying bugs by masking them instead of fixing them; they will often still show up in rare cases, even if they appear fixed in the common case.
In particular, in Raft, there are wide ranges of timeouts that will let your code work. While you CAN pick bad timeout values, it won't take much time to find timeouts that are functional.
It's worth noting that there may be multiple faults in your code! It is often easiest to narrow down a bug by focusing on the first fault, rather than the last, because then there will be fewer errors in the program state to consider.
Try to avoid ruling out a bug in one piece of code simply because you believe that it's correct. If you're dealing with a bug, that usually implies that there's something wrong with your mental model of your code -- either how your implementation actually works, or else your understanding of how it's supposed to work. As such, you can't rely too much on your mental model; always verify your assumptions.
Debugging is a complex and multifaceted skill, and doing it well takes discipline and clear consideration. Try to think carefully about why your debugging approaches are succeeding or failing (or even just being convenient or painful), and take the time to find and explore new approaches that may help you. Time invested now in improving your debugging knowledge will pay off well in later labs.
Making premature fixes are often dangerous when debugging. They may or may not solve the issue you're locking at, and are just as likely to shift the exact location and presentation of the fault you're trying to track down. Even worse, they may simply mask the real fault, rather than solve it. Wait until you're confident about the exact fault causing your observable error before you attempt a fix.
Sometimes, making a fix doesn't immediately solve all of your problems. Sometimes, the same test case still fails, because the old fault was corrected, and a new fault that was previously hidden is now visible.
Don't neglect the possibility of bugs in your glue code. Elements like main loops, locks, and channel I/O may seem simple and unlikely to be wrong, but they can sometimes hide extremely challenging bugs.
When possible, consider writing your code to "fail loudly". Instead of trying to tolerate unexpected states, try to explicitly detect states that should never be allowed to happen, and immediately report these errors. Consider even immediately calling the Go 'panic' function in these cases to fail especially loudly. See also the Wikipedia page on Offensive programming techniques. Remember that the longer you allow errors to remain latent, the longer it will take to narrow down the true underlying fault.
When dealing with a bug that occurs sporadically, the best approach is usually to log aggressively and dump the output to a file. It's easier to filter out irrelevant parts of a verbose log (such as with a separate script) than it is to wait for the error to reappear after N runs.
When you're failing a test, and it's not obvious why, it's usually worth taking the time to understand what the test is actually doing, and which part of the test is observing the problem. It can be helpful to add print statements to the test code so that you know when events are happening.
And one last note: locking strategy matters. A lack of races reported by the race detector does not indicate that your locking strategy is correct. In particular, fine-grained locks can be dangerous, because they can introduce more changes for interleaving.