Home/Blog/Programming/OmniTable: Making debugging simple and productive
Home/Blog/Programming/OmniTable: Making debugging simple and productive

OmniTable: Making debugging simple and productive

Abdul Qadeer
Jan 03, 2024
9 min read

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

Modern software is complex and inadvertently contains bugs. The nature of a specific bug can make a debugging session long and tedious. For example, finding the root cause of performance bugs or catching the race conditions can be challenging, especially if the software is a distributed systemDifferent parts of the overall system are deployed on different nodes..

Note: According to one study, developers spend 35% to 50% of their time in validation and debugging

A recent innovation, OmniTable (OT)Quinn, Andrew, Jason Flinn, Michael Cafarella, and Baris Kasikci. "Debugging the OmniTable Way." In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp. 357-373. 2022., is a new abstraction that collects program states in a database-like table and then enables SQL queries to run on that table to reach the root cause of a bug.

The following blog will first skim through the usual tools for debugging and the aspects in which they fall short. Then, the innovations OmniTable brings for debugging will be detailed, followed by the high-level details of the OT system. Finally, the initial results of OmniTable on some popular software will be discussed, such as on Redis, an in-memory key-value store.

Debugging today and its shortcomings#

A typical debugging session requires that a developer has a hypothesis for why a bug might be occurring. Then that developer makes different observations in the execution state of their program. Simpler methods, such as using breakpoints and single-stepping or using printf() debugging, might work on shorter blocks of code but are often cumbersome or impossible with a large codebase and complex bugs.

Another option is automatic binary-level instrumentation by a tool. Binary instrumentation is a technique that inserts additional instructions into a compiled program to glean interesting information, such as parameters passed to a specific function and a function's return value. After these kind of additions, the modified binary is still a legitimate program and doesn't need any recompilation. Often there is a considerable gap between what a tool can show and what is needed. Additionally, tools that can output a lot of state-per-program steps (or at breakpoints) tend to cause a substantial slowdown in the program execution.

Developers need the appropriate state of a buggy program based on a specific input. Often they rely on manual instrumentation of their code. On the one hand, doing so litters the business logic code with debugging code. On the other hand, writing code like this is inherently difficult because it’s hard to ensure that the debugging code itself doesn’t contain bugs. Bugs embedded in debugging code can point the developers in the wrong direction.

OmniTable brings improvements to all of the scenarios above. But before studying those improvements, there needs to be some metric to measure those improvements.

Two important metrics are programming complexity and performance overhead. For developer productivity, we need a tool that has low programming complexity and low performance overhead.

Current debugging tools can cause a slowdown of 2 to 1000 times when the amount of state observed increases. Often, a single tool is not enough, and each of them has its own learning curve.

OmniTable innovations#

There are three key ideas in OT abstraction.

  • OmniTable decouples the debugging logic from the actual execution of the system. It records the execution once, compactly, and then can replay it as needed to generate the required state of the program at specific places.

  • OmniTable captures all the states at the instruction level and represents it as a giant relational table. Later, derived tables are made per need to materialize high-level constructs, like those for variables and functions. At first glance, this idea seems like a nonstarter because, even on a single server, capturing all the states (machine registers, memory, etc.) will be like doing a core dump on each instruction. It will be impossible to store such a huge state due to its sheer size. OT implementation uses a series of optimizations to make it viable, such as lazy materialization of table data from the playback at only the places the developer asks for. Because OT has all the possible states, looking for a specific state means running different queries. OmniTable can use thousands of clustered machines to inspect states of different points in time, by running them in parallel. (A prototype of OmniTable used Spark for this purpose.)

  • Expressing the debugging intent of a developer in a SQL query is more succinct than an equivalent debugging script in a tool such as gdb. One important assumption is that developers are well versed in the standard SQL queries. Hence, SQL queries are an effective way to transform their debugging hypothesis into a series of queries. On the system-implementation side, database developers have decades of experience in query planning, optimizations, and execution for better performance. OmniTable taps into that knowledge base. The OT system utilizes parallelism to execute different subqueries of a query in parallel to increase speed.

OmniTable system design#

The OT system can be viewed as having two major phases—data collection and querying. Let's look at each of them.

Phase 1: Data collection and table formations#

OmniTable needs to record the full execution state over a period of time. Deterministic recording and replay of a software system, in which years' worth of data are recorded on a commodity hard disk, is possible.

The following figure shows what a typical OmniTable might look like.

An OmniTable for a short execution
An OmniTable for a short execution

The following is a schema to materialize variables, Vars(ot), and functions, Funcs(ot). Here, ot is an alias for the OmniTable.

The schema of the Funcs(ot) and Vars(ot) views, each line in each table describes a column
The schema of the Funcs(ot) and Vars(ot) views, each line in each table describes a column

Typical materialized views of variables and functions are as follows:

Example data from Vars(ot)
Example data from Vars(ot)

Example data from Funcs(ot) (omitting the callStack and thread columns)
Example data from Funcs(ot) (omitting the callStack and thread columns)

Phase 2: Querying the tables#

At this point, executing a developer's query on the program state is like any database query in that proper planning and optimization are needed for good performance. The following figure shows the steps each query goes through. Here, SteamRoller is a prototype component of an experimental system that implemented the OT system.

SteamDrill steps for query resolution (purple steps reuse or customize existing approaches, and white steps are new designs)
SteamDrill steps for query resolution (purple steps reuse or customize existing approaches, and white steps are new designs)

The following figure is a query plan that makes another view over the memory named DefinedMemory(ot). This plan creates a view that contains a window of time during which a few Redis objects were created.

The relational tree for a query to see the specific state of Redis items
The relational tree for a query to see the specific state of Redis items

In the tree above, OmniTables are red ovals labeled with OT. Purple ovals represent user-defined functions. Instruction definitions are ID nodes, and function definitions are FD nodes. Relational operators are white rectangles in which where, join, and select are represented by σ, Join, and Π nodes. The logic for each derived view is encapsulated in a dotted rectangle.

Example use case: Finding a caching bug in Redis#

Let's assume that a developer wants to debug a performance problem in Redis. Over time, the user-perceived latency increases, and the developers observed that a lot of calls are going to the backend-persistence store instead of being served by the Redis cache.

It’s a complex scenario, but the OT system allows the developer to reach the root cause in just five queries. The first query shown below is one in which a windowed average is taken over the working-set size of the Redis cache. A few observations are in order:

  • The amount of code needed to do similar work by a gdb script (the right-hand script in the following figure) is much larger and more complex than the SQL query.

  • The developer can write as many queries over the state as needed, and the OT system will execute it efficiently by replaying the execution.

  • The developers can easily move across time or any program state that they deem necessary without actually executing the program. This is possible because the first execution has been recorded, and OmniTable can extract and materialize tables on demand.

Comparison of number of lines of code using OT versus Python's gdb binding script
Comparison of number of lines of code using OT versus Python's gdb binding script

The results of the query above show that the size of the Redis cache was almost constant over the execution. That indicates that the cache replacement code probably isn’t working properly, and the developer's subsequent queries verify that hypothesis. (We aren’t including the remaining four queries in this blog for brevity, but more details can be found in the original paper.

Results#

An experimental study using a prototype shows that OT queries need up to 11.67 times fewer lines and 23.49 times less time spent as compared to writing debugging scripts via python binding of gdb. The following table shows the results of using OmniTable on real-world workload. The table below compares the OT queries with the gdb scripts, and the metrics used are the number of lines of code, the number of terms (nodes) in an abstract syntax tree of the code (or query), and the Halstead complexity to estimate the amount of time it might take to correctly write the script or query. Clearly, OmniTable performs much better on all three fronts as compared to gdb.

Code Lines, Nodes, and Halstead Complexity for debugging questions expressed using OT queries and gdb python scripts
Code Lines, Nodes, and Halstead Complexity for debugging questions expressed using OT queries and gdb python scripts

The following graph shows the scalability of OmniTable implementation when run on a single core versus 64 cores. The speedup is promising and indicates that by adding more workers in the system, we can increase the speed the debugging session.

SteamDrill query latency on a single core and on 64 cores compared to gdb script latency, which is sequential (Y-axis is log-scale)
SteamDrill query latency on a single core and on 64 cores compared to gdb script latency, which is sequential (Y-axis is log-scale)

So the results show that OmniTable provides both low programming complexity for the developers and has a low-performance penalty.

Takeaways#

The original paper on OmniTable presented a new abstraction for efficient debugging and backed it up with a prototype (typical of a research study). While this idea will probably morph into some concrete software product over time, now is a good time to think about how to embrace this idea for systems.

OmniTable looks like a promising system to efficiently debug complex software. Clearly, it’s much better than the usual ways of debugging, though coming up with the right view formation and queries can be challenging at first. The ability to compactly and efficiently materialize any required program state is fascinating and uses an exciting series of optimizations (new and old) from the database domain.

Often, developers are looking for exciting side projects. Here’s one! Build a system on the ideas of OmniTable! SteamDrill is a prototype implementation that has its code available under a BSD license, which is an open invitation for developers to build on it or to simply use it in software to see it in action!

Note: You can brush up debugging and SQL query concepts using the following resources:

Frequently Asked Questions

What is OmniTable debugging software?

OmniTable separates the debugging process from the initial execution. This approach significantly lowers the performance impact that is typically associated with debugging.


  

Free Resources