Scalable Data Lake
Learn about data lakes and how they can be architected in AWS.
A data lake is a centralized location for storing data that has been ingested from various places. The term was coined around 2011 to distinguish it from other forms of centralized data storage. Others creatively coined the term “data swamps” to describe badly managed data lakes.
In this lesson, we consider the AWS approach for setting up a data lake and how a data lake differs from data warehouses and production data stores.
AWS services for scalable data lake
The AWS team suggests the following two services for setting up a scalable data lake: Simple Storage Service (S3) and Lake Formation.
Amazon S3
Amazon’s Data Lake on AWS architecture recommends S3 as the centralized location to store data of all formats.
Amazon S3 is a scalable and cost-effective way to store a variety of objects and has been widely used among AWS customers of all sizes and industries.
Since its launch in 2006, S3 now stores over 100 trillion objects and can handle tens of millions of requests per second.
Amazon S3 is similar to a cloud-based file system. It consists of buckets containing folder and file objects.