Search Restrictions Solution: Denormalization

Revisit the permitted search conditions for querying an Apache Cassandra table and explore denormalization as a viable solution to the search restrictions.

We'll cover the following

In Apache Cassandra, the table's primary key imposes restrictions on search conditions, impacting design of the relational model. In this lesson we will focus on these limitations and denormalization as a potential solution.

To reiterate, in a SELECT query:

  • Only the columns that are part of the primary key can be used in the WHERE clause as filtering conditions.

  • Only equality searches can be performed on partition key columns. Range queries are allowed on clustering columns.

  • If the partition key is composed of multiple columns (composite partition key), all partition key columns must be included in the WHERE clause. 

  • Clustering columns may not be included in the search condition. To include a clustering column in the WHERE clause, all clustering columns defined before it in the primary key must also be included in the search condition.

Denormalization

In relational databases, data modeling revolves around entities and the relationships between them. In Cassandra data modeling, access patterns/queries are considered at the beginning of the process and tables are designed to fulfill these queries.

Denormalization is a key practice in Cassandra table design. As Cassandra does not support joins and a SELECT query targets a single table only, all the requested data must be present in that table.

For example, let's assume user feedback on courses and instructors is collected in the form of rating and comments. In a relational database, separate tables would be created for each entity—users, courses, instructors, and ratings, etc. Queries would join these tables at runtime to deliver the required data. In Cassandra, a query retrieves data from one table and the table’s partition key value must be specified to retrieve the data. Thus, to maintain feedback for instructors a table ratings_by_instructor would be created with partition key instructor. Similarly, to maintain feedback for courses, a table ratings_by_course would be created with partition key course title. Both tables have the same columns, and duplicate data. The only difference between the tables is the partition key.

Denormalization involves the creation of tables with duplicate data to address specific queries, which would have required a join operation in a relational database. Denormalization can be considered as performing a join on write. Cassandra offers fast writes so writing to multiple tables does not have a huge performance impact. Denormalization ensures data is readily available for a read, as and when required. The result is an efficient data model with predictable query performance.

In the previous section, we explored the courses_by_category table partitioned and distributed around the cluster on the category column. The selection of category as a partition key forces all queries to include the partition key in query filtering conditions. The courses_by_category table cannot be queried specifying the instructor only.

To fulfill queries regarding courses with the instructor’s value only, a new table courses_by_instructor needs to be created, with a partition key: instructor.

Get hands-on with 1300+ tech skills courses.