...

/

Data Modeling Process

Data Modeling Process

Learn the underlying goals/objectives and steps of Apache Cassandra data modeling process, and how it compares to the conventional approach used in relational databases.

Data modeling in RDBMS vs. Apache Cassandra

In traditional RDBMS, data modeling is entity-driven and table-centric. Normalized tables hold data with foreign keys referencing related data in other tables. Queries are impacted by the organization and structure of tables and the use of table joins. Referential integrity is enforced by the database.

In contrast, Cassandra’s data modeling is query-driven and query-centric. A table is designed to fulfill a query or a set of queries. Cassandra does not support table joins, and a query must only access a single table, resulting in very fast reads. Thus, Cassandra tables are denormalized and contain all the data (one or more entities) that a query requires. Multiple queries for a single entity, and each query backed by a separate table, result in entity data being duplicated across multiple tables.

Relational Databases

Apache Cassandra

Relational data modeling methodology

Cassandra data modeling methodology

Entity driven

Query driven

Table-centric

Query-centric

Table joins & RI (Referential Integrity)

Denormalization - no joins, no RI

PK for uniqueness

PK for partitioning, uniqueness & ordering

Often a SPOF (single point of failure)

Distributed architecture - no SPOF

ACID compliant

CAP theorem

Cassandra excels in achieving high write throughput, providing nearly uniform efficiency for all write operations. Additionally, disk space is a far cheaper resource as compared to CPU, memory, or network. Therefore, Apache Cassandra utilizes denormalization and data duplication to perform additional writes to enhance the efficiency of read operations, which are typically costly and present greater optimization challenges.  

Apache Cassandra data modeling goals

To design a successful schema in Apache Cassandra, the following high-level goals must be kept in mind:

Even distribution of data across the cluster

Rows of Cassandra tables are partitioned and distributed around nodes in the cluster based on the hash of the partition key. By spreading data evenly, each node in the cluster is responsible for an equal portion of the data, resulting in load balancing. This ensures optimal performance and prevents some nodes from becoming overwhelmed with a disproportionately large amount of data. Additionally, even data distribution allows even workload distribution, resulting in faster response times and increased throughput.

Even data distribution also enables the system to scale seamlessly in horizontal scalingAbility of the database system to handle increased load by adding more machines to the cluster, rather than upgrading existing hardware to handle more traffic.. When adding new nodes to the cluster, Cassandra’s automatic data distribution mechanism ensures that data is spread evenly across the new nodes, maintaining the desired balance.

A table’s partition key plays a crucial role in achieving even data distribution across the cluster. Choosing a suitable partition key requires careful consideration of the data access patterns, query requirements, and cardinality of the data. A good practice is to select a partition key that provides a good distribution of values and avoids data skew, where certain partitions receive significantly more data than others.

Minimum number of partitions accessed by a query

...