Data Modeling Process

Learn the underlying goals/objectives and steps of Apache Cassandra data modeling process, and how it compares to the conventional approach used in relational databases.

We'll cover the following...

Data modeling in RDBMS vs. Apache Cassandra
Apache Cassandra data modeling goals
- Even distribution of data across the cluster
- Minimum number of partitions accessed by a query
Apache Cassandra data modeling process

Data modeling in RDBMS vs. Apache Cassandra

In traditional RDBMS, data modeling is entity-driven and table-centric. Normalized tables hold data with foreign keys referencing related data in other tables. Queries are impacted by the organization and structure of tables and the use of table joins. Referential integrity is enforced by the database.

In contrast, Cassandra’s data modeling is query-driven and query-centric. A table is designed to fulfill a query or a set of queries. Cassandra does not support table joins, and a query must only access a single table, resulting in very fast reads. Thus, Cassandra tables are denormalized and contain all the data (one or more entities) that a query requires. Multiple queries for a single entity, and each query backed by a separate table, result in entity data being duplicated across multiple tables.

Cassandra excels in achieving high write throughput, providing nearly uniform efficiency for all write operations. Additionally, disk space is a far cheaper resource as compared to CPU, memory, or network. Therefore, Apache Cassandra utilizes denormalization and data duplication to perform additional writes to enhance the efficiency of read operations, which are typically costly and present greater optimization challenges.

Apache Cassandra data modeling goals

To design a successful schema in Apache Cassandra, the following high-level goals must be kept in mind:

Even distribution of data across the cluster

Rows of Cassandra tables are partitioned and distributed around nodes in the cluster based on the hash of the partition key. By spreading data evenly, each node in the cluster is responsible for an equal portion of the data, resulting in load balancing. This ensures optimal performance and prevents some nodes from becoming overwhelmed with a disproportionately large amount of data. Additionally, even data distribution allows even workload distribution, resulting in faster response times and increased throughput.

A table’s partition key plays a crucial role in achieving even data distribution across the cluster. Choosing a suitable partition key requires careful consideration of the data access patterns, query requirements, and cardinality of the data. A good practice is to select a partition key that provides a good distribution of values and avoids data skew, where certain partitions receive significantly more data than others.

Minimum number of partitions accessed by a query

This goal is aimed at optimizing read operations. In Cassandra, each table’s data is distributed around the cluster nodes in partitions based on the partition key. Each partition represents a unit of data storage and can contain ...

Relational Databases	Apache Cassandra
Relational data modeling methodology	Cassandra data modeling methodology
Entity driven	Query driven
Table-centric	Query-centric
Table joins & RI (Referential Integrity)	Denormalization - no joins, no RI
PK for uniqueness	PK for partitioning, uniqueness & ordering
Often a SPOF (single point of failure)	Distributed architecture - no SPOF
ACID compliant	CAP theorem

Getting Started

Apache Cassandra Overview

Apache Cassandra Architecture

Apache Cassandra Data Modeling

Apache Cassandra Table

Apache Cassandra Data Types

Tunable Consistency

Apache Cassandra Read and Write Path

Wrap Up

Data Modeling Process

Data modeling in RDBMS vs. Apache Cassandra

Apache Cassandra data modeling goals

Even distribution of data across the cluster

Minimum number of partitions accessed by a query