Build a Scalable Data Pipeline

Data pipeline

Our real-time application keeps our users up to date with the latest information possible. This means we have to get data from our server to our clients, potentially a lot of data, as quickly and efficiently as possible. Delays or missed messages will cause users to not have the most current information in their display, affecting their experience. We must be intentional in designing how the data of our application flow due to the importance of this part of our system. The mechanism that handles outgoing real-time data is a data pipeline.

A data pipeline should have certain traits to work quickly and reliably for our users. We’ll cover these traits before writing the code. We’ll then see how to use the Elixir library GenStage to build an utterly in-memory data pipeline. We’ll learn about GenStage’s features that are important for a data pipeline but would be challenging to build traditionally.

We’ll measure our pipeline to know that it’s working correctly. Finally, we’ll see what makes GenStage such a robust base for a data pipeline. Let’s start by going over the traits of a production-grade data pipeline.

Traits of a data pipeline

Our data pipeline should have a few traits no matter what technology we choose. Our pipeline can scale from a performance and maintainability perspective when it exhibits these traits.

Deliver messages to all relevant clients

This means that a real-time event will be broadcast to all our connected Nodes in our data pipeline so they can handle the event for connected Channels. Phoenix PubSub handles this for us, but we must consider that our data pipeline spans multiple servers. We should never send incorrect data to a client

Fast data delivery

Our data pipeline should be as fast as possible. This allows a client to get the latest information immediately. Producers of data should also trigger a push without worrying about performance.

Durability

Our use case might require that push events have strong delivery guarantees, but our use case can also be more relaxed and allow for in-memory storage until the push occurs. In either case, we should be able to adjust the data pipeline for our needs, or even completely change it, in a way that doesn’t involve completely rewriting it.

Concurrency

Our data pipeline should have limited concurrency so we don’t overwhelm our application. This is use case dependent, since some applications are more likely to dominate different system components.

Measurable

We must know how long it takes to send data to clients. If it takes one minute to send real-time data, that reduces the application’s usability.

These traits allow us to have more control over how our data pipeline operates, both for the happy path and failure scenarios. There has always been debate over the best technical solution for a data pipeline. A good solution for many use cases is a queue-based, GenStage-powered data pipeline. This pipeline exhibits the above traits while also being easy to configure.

Next, we’ll walk through writing a data pipeline powered by GenStage.

Get hands-on with 1300+ tech skills courses.