Metrics Sense: Designing a Metric II
An example interview question about designing a latency metric.
We'll cover the following
Question
You are the TPM for a service that has multiple customer clients who send many (often large) requests to your service’s API and it receives a response. You want to set a latency
Background
This question is both a metric and technical question combined into one, making it excellent for practicing both skills needed for a TPM interview. Setting SLAs for platforms with many different clients is also a common requirement, and it’s critical for TPMs to at least be familiar with them since it’s important in managing cross-team and external relationships and expectations.
Solution approach
This is a two-part question; we need to define the SLA (the metrics portion) and then discuss an implementation option.
- For the metrics portion, we’ll start by listing several options and the pros and cons of each option. Then, we’ll provide a recommendation of which option we should go with and why.
- For the implementation portion, we’ll discuss a way to capture the necessary events to measure the SLA. We’ll apply some basic understanding of common data structures in this section.
Sample answer
Let’s start by enumerating some potential options for this SLA:
Option 1
We can have a strict SLA per-request that says each request will be handled within a certain time window (i.e., 1ms).
-
Pros:
This is great for setting expectations with customers. They will know that every request will be handled within a certain time period and can plan accordingly.
-
Cons:
Unless our supported functionality is very tightly scoped such that our requests are homogenous, then this will be very difficult or expensive to actually guarantee. Machines can fail, and requests can vary in complexity and difficulty; or our own dependencies may not have a strict SLA, which means this SLA will likely not be meaningful.
Option 2
We can have a percentile-based SLA that only guarantees that a certain percent (i.e., 50% or 95%) of requests will be handled within a certain time bound. This is called “p-N latency”.
-
Pros:
We give a reasonable expectation to our customers while still giving us some leeway in case of failures or very expensive requests.
-
Cons:
We need to be careful with aggregates since they can mask problems. For example, if we set the percentile too low, then we could be hitting our SLA while still providing bad service for a large number of important requests.
Option 3
We can divide the requests into different types of requests (i.e., by priority or by the request client) and have different SLAs for those requests. For example, we can define a notion of “PO” requests and say these will be handled in < 1 ms and all other requests will be handled in < 2 ms.
-
Pros:
This allows us to reasonably set expectations, provide service to high priority requests/customers, and gives us flexibility to manage requests.
-
Cons:
This means we need to generate a new taxonomy (i.e., which requests are high priority? What’s the specific criteria?) to measure this SLA. We will also need to pass, process, and store this additional metadata, increasing our compute and storage costs.
Option 4
We can take a hybrid approach by doing some combination of the above (i.e., for “PO” requests, every request will be handled <1ms; for all other requests at the 95th percentile, it will be handled <2ms).
-
Pros:
This allows us to capture the benefits of the multiple approaches listed above while mitigating some of the cons of each approach.
-
Cons:
The process to manage and measure the SLAs will be more complex and expensive since we will need to track and store more variables.
Recommendation
If our service is quite mature, then Option 4 is the ideal SLA since it allows us to differentiate among all the requests we receive, and it allows us to prioritize the most important ones. However, if our service is less mature or we have very limited functionality, then we may want to consider a simpler option such as Option 1 or Option 2.
Instrumentation
In terms of instrumenting our system to enable measurement of the SLA, we will maintain a global key-value map. For each request we receive, we will assign and log a request_id as the key. For the corresponding value, we will log the request_time (the time we receive the request from the client) and reply_time (when we send a reply back to the client) in a list. Writing to this key-value map will be done by all front-end servers receiving requests and replying to requests. These front-end servers will write the request_time for the corresponding request_id when requests are received and will write the reply_time just before the reply is sent back to the client. Request_ids will be propagated with each request/reply. Here is a system diagram of how this would look:
Level up your interview prep. Join Educative to access 80+ hands-on prep courses.