Multithreaded Debugging
Learn how to debug multithreaded issues.
We'll cover the following
When the program mishandles scenario(s) where two or more threads share a resource, it leads to multithreaded issues. Depending on the language and the platform, threads can share various resources (variables, files, locks, etc.). Each thread must be mindful that a shared resource must be handled carefully. Ensuring that no other thread is accessing it simultaneously when a thread modifies a shared resource is vital.
Some common symptoms of multithreading issues are crashes, hangs, corruption, etc. Also, another critical symptom in addition to the above is nondeterminism or inconsistency of the symptom where the crash or the hang may only happen sometimes. This course covers how to handle crashes and deadlocks in their lessons. Here, we’ll discuss some general ideas and guidelines.
Pattern to debug multithreaded issues
For multithreading issues, our efforts are mostly to understand how the shared resource is accessed at various points by the various threads handling it. So unlike our previous ideas, where the code flow was our primary focus, it is slightly different here, wherein it is the shared resource we’ll concentrate on and then move on to trace through the relevant code paths. We’ll present a general pattern or an ordered sequence of steps to follow for debugging multithreaded issues. The goal is to identify the shared resource and code paths involved and the mismanagement.
Step 1: Identify the shared resource
The shared resource is at the heart of the issue. If the symptom involved is a crash or even a deadlock where the backtrace is available, it should be readily identifiable: the shared resource will invariably be on top of the backtrace. If a backtrace is unavailable or the symptom is different, it is wise to look for shared resources in the suspected code areas, e.g., a global or static variable.
Step 2: Identify code paths to access or modify this shared resource
The suspected shared resource might be written and modified at various functions invoked from different code paths. The next step is to identify these different code paths or build a call graph for each code area that accesses this resource from the code. If this call graph is enormous, it is enough to stop at some subcomponent and not go up to the top of the call graph.
Step 3: Identify the actual code paths in execution
Only a subset of the code paths that can access the shared resource may be in execution and cause the bug. So, our next step is to identify which code paths are in play from the logs or other diagnostic information. The logs can reveal the code paths in execution around the timeline of the bug. So, simple code tracing from the logs, keeping in mind the timestamp, thread ID, etc., will do the trick. Here, if the logs contain the subcomponent field, it can come in handy. We can now go back to step 2 and develop the call graphs for the subcomponents we see in the logs.
Step 4: Build a timeline of events
From the code paths, the thread IDs, and time stamps, we can devise a timeline of when and how the threads are accessing the shared variable. Sometimes, a timeline is constructible from the code or the logs. It is good to start with the code because it is usually good enough in most cases. Secondly, a good understanding of the code could help us understand the logs better and develop a good intuition.
Step 5: Identify the root cause or theorize
The bug should be readily visible from the timeline above. One should at least be able to theorize what has transpired if not come right to the point directly.
Note: It is important to remember that the steps mentioned above are just suggestions and directions. It cannot be claimed that following the steps precisely as mentioned is always possible and will lead to success. These steps might need to be adopted or modified based on the situation.
Pattern demonstration
Let’s use the pattern above to debug a multithreaded issue.
Get hands-on with 1400+ tech skills courses.