Data Containers and Pipelines
Understanding the different data containers and the significance of data pipelines for efficient data management.
We'll cover the following
Big data
AI/ML products run on data. Where and how we store our data is a big consideration that impacts our AI/ML performance, and in this section, we will be going through some of the most popular storage vehicles for our data. Figuring out the optimal way to store, access, and train our data is a specialization in and of itself, but if we’re in the business of AI product management, eventually, we’re going to need to understand the basic building blocks of what makes our AI product work. In a few words, data does.
Because AI requires big data, this is going to be a significant strategic decision for our product and business. If we don’t have a well-oiled machine, pun intended, we’re going to run into snags that will impair the performance of our models and, by extension, our product itself. Having a good grasp of the most cost-effective and performance-driven solution for our particular product and finding the balance within these various facets is going to help our success as a product manager. Yes, we’ll depend on our technical executives for a lot of these decisions, but we’ll be at the table helping make these decisions, so some familiarity is needed here. Let’s look at some of the different options to store data for AI/ML products.
Database
Depending on our organization’s goals and budget, we’ll be centralizing our data somehow between a data lake, a database, and a data warehouse, and we might even be considering a new option—a data lakehouse. If we’re just getting our feet wet, we’re likely just storing our data in a relational database so that we can access it and query it easily. Databases are a great way to do this if we have a relatively simple setup.
With a relational database, there’s a particular schema we’re operating under; if we wanted to combine this data with data that’s in another database, we would run into problems aligning these schemas later. If our primary use of the database is querying to access data and use only a certain subset of our company’s data for general trends, a relational database might be enough. If we’re looking to combine various datasets from disparate areas of our business and we’re looking to accomplish more advanced analytics, dashboards, or AI/ML functions, we’ll need to read on.
Data warehouse
If we’re looking to combine data into a location where we can centralize it somewhere and we’ve got lots of structured data coming in, we’re more likely going to use a data warehouse. This is really the first step toward maturity because it will allow us to leverage insights and trends across our various business units quickly. If we’re looking to leverage AI/ML in various ways rather than one specific specialized way, this will serve us well.
Let’s say, for example, that we want to add AI features to our existing product as well as within our function. We’d be leveraging our customer data to offer trends or predictions to our customers based on the performance of others in their peer group, as well as using AI/ML to make predictions or optimizations for our internal employees. Both these use cases would be well served with a data warehouse.
Data warehouses do, however, require some upfront investment to create a plan and design our data structures. They also require a costly investment as well because they make data available for analysis on demand, so we’re paying a premium for keeping that data readily available. Depending on how advanced our internal users are, we could opt for cheaper options, but this option would be optimal for organizations where most of our business users are looking for easily digestible ways to analyze data. Either way, a data warehouse will allow us to create dashboards for our internal users and stakeholder teams.
Data lake and lakehouse
If we’re sitting on lots of raw, unstructured data, and we want to have a more cost-effective place to store it, we’d be looking at a data lake. Here, we can store unstructured, semi-structured, as well as structured data that can be easily accessed by our more tech-savvy internal users. For instance, data scientists and ML engineers would be able to work with this data because they would be creating their own data models to transform and analyze the data on the fly, but this isn’t the case at most companies.
Keeping our data in a data lake would be cheap if we’ve got lots of data our business users don’t need immediately, but we won’t ever really be able to replace a warehouse or a database with one. It is more of a “nice to have.’’ If we’re sitting on a massive data lake of historical data we want to use in the future for analytics, we’ll need to consider another way to store it to get those insights
We might also come across the term lakehouse. There are many databases, data warehouses, and data lakes out there. However, the only lakehouse we’re aware of has been popularized by a company called Databricks, which offers something like a data lake but with some of the capabilities we get with data warehouses, namely, the ability to showcase data, make it available, and ingestible for non-technical internal users, and create dashboards with it. The biggest advantage here is that we’re storing it and paying for the data to be stored upfront with the ability to access and manipulate it downstream.
Data pipelines
Regardless of the tech we use to maintain and store our data, we’re still going to need to put up pipelines to make sure our data is moving, that our dashboards are refreshing as readily as our business requires, and that data is flowing the way it needs to. There are also multiple ways of processing and passing data. We might be doing it in batches—batch processing—for large amounts of data being moved at various intervals or in real-time pipelines for getting data in real-time as soon as it’s generated.
If we’re looking to leverage predictive analytics, enable reporting, or have a system in place to move, process, and store data, a data pipeline will likely be enough. However, depending on what our data is doing and how much transformation is required, we’ll likely be using both data pipelines and, perhaps, more specifically, ETL pipelines.
ETL
ETL stands for extract, transform, and load, so our data engineers are going to be creating specific pipelines for more advanced systems, such as centralizing all our data into one place, adding data or data enrichment, connecting our data with customer relationship management (CRM) tools, or even transforming the data and adding structure to it between systems.
The reason for this is that it’s a necessary step when using a data warehouse or database. If we’re exclusively using a data lake, we’ll have all the metadata we need to be able to analyze it and get our insights as we like. In most cases, if we’re working with an AI/ML product, we’re going to be working with a data engineer who will power the data flow needed to make our product a success because we’re likely using a relational database as well as a data warehouse. The analytics required to enable AI/ML features will most likely need to be powered by a data engineer who will focus on the ETL pipeline.
Managing and maintaining this system will also be the work of our data engineer, and we encourage every product manager to have a close relationship with the data engineer(s) that supports their products.
Note: One key difference between the two is that ETL pipelines are generally updated in batches and not in real time. If we’re using an ETL pipeline, for instance, to update historical daily information about how our customers are using our product to offer client-facing insights in our platform, it might be optimal to keep this batch updating twice daily.
If we need insights to come in real time for a dashboard that’s being used by our internal business users and they rely on that data to make daily decisions, however, we likely will need to resort to a data pipeline that’s updated continuously.
Match the answers
Match the following scenarios with their correct options.
Company A has structured data, has a good investment to add AI features in existing products, and wants data analytics and trends insights. What is a suitable choice for company A?
Data lake
Company B has lots of unstructured and semi-structured data, does not have much of a budget, and occasionally requires data transformation and analysis. What is a suitable choice for company B?
Relational database
Company C has a minimal budget and structured data and requires frequent data access and general trends within the company. What is a suitable choice for company C?
Data warehouse