Measure Everything

An introduction to measurements and statistics in a Phoenix application.

Introduction

We now have the tools and knowledge to build a real-time application using Phoenix Channels. However, we will need to run this application for actual users to be valid. Our application needs to operate efficiently so that requests do not time out, encounter errors, or otherwise not work correctly.

This section looks at several common scaling challenges and best practices to help avoid performance issues as we develop and ship our application. We’re covering these topics before we build an application because it’s essential to consider them at the design stage of the development process and not after the application is written.

The following performance pitfalls are a collection of common problems affecting applications. We’ll experience many other challenges when building and shipping an application, but we’ll focus on these three because they apply to all real-time applications.

Unknown application health

We need to know if our deployed application is healthy. When our application experiences a problem, we can identify the root cause by looking at all of our metrics. We’ll see how to add measurements to our Elixir applications using StatsD.

Limited channel throughput

Channels use a single process on the server to process incoming and outgoing requests. If we’re not careful, we can constrain our application so that long-running requests prevent the Channel from processing. We’ll solve this problem with built-in Phoenix functions.

Building a data pipeline

We can build a pipeline that efficiently moves data from server to user. We should be intentional in our data pipeline design to know the capabilities and limitations of our solution. We’ll use GenStage to build a production-ready data pipeline.

We’ll walk through each pitfall in detail throughout this section—we’ll see solutions to each as we go. Let’s start by looking at how to measure our Elixir applications.

Measure everything

A software application comprises many interactions and events that power features. The successful combination of all the different events in a feature’s flow causes it to work correctly and quickly. Even if a single step of our application encounters an issue or slows down, the rest of that flow is affected. We need to be aware of everything in our application to prevent and identify problems.

It is impossible to effectively run a decent-sized piece of software without some form of measurement. Software becomes a black box once deployed, and having different viewports into the application lets us know how well things are working. This is so useful that a class of tools has emerged called Application Performance Monitoring (APM). While they usually cost money, these tools are an excellent way to start measuring our applications. Even if we use an APM tool, the content in this section will apply because not everything can be automatically handled.

We will cover a few different measurements that we can use in our application. Many other open-source tools can collect these measurements. We’ll work with one of these tools and see how to use it in our code, but first, we’ll cover a few types of measurements that are useful for most applications.

Types of measurements

The best way to know if our application behaves correctly is to place instrumentation on as many different events and system operations as possible. There are many things we can measure and ways that we could measure them. Here are a few of the simple but effective ways that we can measure things:

  • Count occurrences: This is the number of times an operation happens. We could count every time a message is pushed to our Channel or count every time a Socket fails to connect.

  • Count at a point in the time: This is the value of a component of our system at the moment. The number of connected Sockets and Channels could be counted every few seconds. This is commonly called a gauge in many measurement tools.

  • Timing of operation: This is the amount of time it takes for an operation to complete. We could measure the time taken to push an event to a client after the event is generated.

Each measurement type is helpful in different situations, and there isn’t a single type that’s superior to the others. Combining different measurements into a single view (in our choice of visualization tool) can help pinpoint an issue. For example, we may have a spike in new connection occurrences that lines up with an increase in memory consumption. All of this could contribute to an increase in message delivery timing. Each of these measurements on its own would tell us something, but not the complete picture. Combining all of them contributes to understanding how the system is stressed.

Measurements are usually collected with some identifying information. At a minimum, each measurement has a name and value, but some tools allow for more structured ways of specifying additional data, such as with tags. We can attach other metadata to our measurements to help tell our application’s story. For example, shared online applications often use the concept of tenant to isolate a customer’s data. We could add a tenant_id=XX tag to all metrics to understand the current system health from the perspective of a single tenant.

Now, let’s see how to collect these measurements using StatsD.

Get hands-on with 1300+ tech skills courses.