Process Design: Rollout a Risky Change
An example interview on rolling out a risky change.
We'll cover the following
Question
You need to rollout a risky change, such as a system-wide configuration change or new binary deployment. How would you do that?
Background
Managing change in an ambiguous environment is a core part of the TPM skillset, whether it is a process, system, or feature change. This question is designed to exercise this skill and provide a framework to help you do so.
Similar to a previous question, we’ll use a role playing approach to simulate a real live interview.
Solution approach
We will use the following structured approach for this question:
- Clarify the scope of the change.
- Outline the potential risks.
- Propose mitigations for these risks.
Sample answer
Interviewee: First, I’d like to understand the scale and scope of the change to understand the potential impact. These are some questions I have:
- How will the change be deployed? How fast will the change be felt?
- How many machines will be impacted?
- What are the potential negative effects of this change? How many users would be impacted?
- How fast can we rollback if something goes wrong?
- Do we have access to a test environment to validate these changes?
Interviewer: Good questions. Here is some additional information:
- The change will originate from our deployment server, and once a machine is targeted for the change, it will be in order of minutes for the change to take effect.
- This change will need to go to thousands of servers.
- Please brainstorm things that could go wrong. The machines in scope serve millions of users.
- Rollback can happen as quickly as rolling out the change: order of minutes.
- Yes, you will have access to a test environment.
Interviewee: Great. Let me enumerate some of the potential issues that could go wrong:
- The change could cause machines to crash, taking down our application. This could be due to a bug in the configuration or binary.
- The change could introduce a user-facing bug that could impact some or all users. This can result in any number of impacts depending on the nature of service: lost usage or lost revenue, etc.
- The change could impact performance and introduce unacceptable latency due to more expensive computations.
Interviewer: Sounds like a good list. How would we go about mitigating these risks?
Interviewee: There are several initiatives we can take to mitigate these risks:
Level up your interview prep. Join Educative to access 70+ hands-on prep courses.