Data Modeling Process
Learn the underlying goals/objectives and steps of Apache Cassandra data modeling process, and how it compares to the conventional approach used in relational databases.
Data modeling in RDBMS vs. Apache Cassandra
In traditional RDBMS, data modeling is entity-driven and table-centric. Normalized tables hold data with foreign keys referencing related data in other tables. Queries are impacted by the organization and structure of tables and the use of table joins. Referential integrity is enforced by the database.
In contrast, Cassandra’s data modeling is query-driven and query-centric. A table is designed to fulfill a query or a set of queries. Cassandra does not support table joins, and a query must only access a single table, resulting in very fast reads. Thus, Cassandra tables are denormalized and contain all the data (one or more entities) that a query requires. Multiple queries for a single entity, and each query backed by a separate table, result in entity data being duplicated across multiple tables.
Relational Databases | Apache Cassandra |
Relational data modeling methodology | Cassandra data modeling methodology |
Entity driven | Query driven |
Table-centric | Query-centric |
Table joins & RI (Referential Integrity) | Denormalization - no joins, no RI |
PK for uniqueness | PK for partitioning, uniqueness & ordering |
Often a SPOF (single point of failure) | Distributed architecture - no SPOF |
ACID compliant | CAP theorem |
Cassandra excels in achieving high write throughput, providing nearly uniform efficiency for all write operations. Additionally, disk space is a far cheaper resource as compared to CPU, memory, or network. Therefore, Apache Cassandra utilizes denormalization and data duplication to perform additional writes to enhance the efficiency of read operations, which are typically costly and present greater optimization challenges.
Apache Cassandra data modeling goals
To design a successful schema in Apache Cassandra, the following high-level goals must be kept in mind:
Even distribution of data across the cluster
Rows of Cassandra tables are partitioned and distributed around nodes in the cluster based on the hash of the partition key. By spreading data evenly, each node in the cluster is responsible for an equal portion of the data, resulting in load balancing. This ensures optimal performance and prevents some nodes from becoming overwhelmed with a disproportionately large amount of data. Additionally, even data distribution allows even workload distribution, resulting in faster response times and increased throughput.
Even data distribution also enables the system to scale seamlessly in
A table’s partition key plays a crucial role in achieving even data distribution across the cluster. Choosing a suitable partition key requires careful consideration of the data access patterns, query requirements, and cardinality of the data. A good practice is to select a partition key that provides a good distribution of values and avoids data skew, where certain partitions receive significantly more data than others.