...

/

Operational Excellence on the Cloud

Operational Excellence on the Cloud

Operational Excellence on the Cloud. Learn the fundamental design principles of operating on the Cloud. Operations teams need to understand their business and customer needs so they can effectively and efficiently support business outcomes. They create and use procedures to respond to operational events and validate their effectiveness to support business needs. They also collect metrics that are used to measure the achievement of desired business outcomes.

Operational Excellence

The operational excellence pillar includes the ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.

The operational excellence pillar provides an overview of design principles, best practices, and questions

Design Principles

There are six design principles for operational excellence in the cloud:

Perform operations as code:

In the cloud, you can apply the same engineering discipline that you use for application code to your entire environment. You can define your entire workload (applications, infrastructure, etc.) as code and update it with code. You can script your operations procedures and automate their execution by triggering them in response to events. By performing operations as code, you limit human error and enable consistent responses to events.

Annotate documentation:

In an on-premises environment, documentation is created by hand, used by people, and hard to keep in sync with the pace of change. In the cloud, you can automate the creation of documentation after every build (or automatically annotate hand-crafted documentation). Annotated documentation can be used by people and systems. Use annotations as an input to your operations code.

Make frequent, small, reversible changes: Design workloads to allow components to be updated regularly. Make changes in small increments that can be reversed if they fail (without affecting customers when possible).

Refine operations procedures frequently:

As you use operations procedures, look for opportunities to improve them. As you evolve your workload, evolve your procedures appropriately. Set up regular game days to review and validate that all procedures are effective and that teams are familiar with them.

Anticipate failure:

Perform “premortem” exercises to identify potential sources of failure so that they can be removed or mitigated.

Test your failure scenarios and validate your understanding of their impact. Test your response procedures to ensure that they are effective and that teams are familiar with their execution. Set up regular game days to test workloads and team responses to simulated events.

Learn from all operational failures:

Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.

Definition

There ...