Search Restrictions Solution: Secondary Index

Review the search conditions for querying an Apache Cassandra table and examine Secondary Index as an effective solution to address these search limitations.

In Apache Cassandra, the primary key of a table restricts search conditions affecting the design of the relational model. This lesson will address these limitations and explore the secondary index as a solution.

To reiterate, in a SELECT query:

  • Only the columns that are part of the primary key can be used in the WHERE clause as filtering conditions.

  • Only equality searches can be performed on partition key columns. Range queries are allowed on clustering columns.

  • If the partition key is composed of multiple columns (composite partition key), all partition key columns must be included in the WHERE clause. 

  • Clustering columns may not be included in the search condition. To include a clustering column in the WHERE clause, all clustering columns defined before it in the primary key must also be included in the search condition.

Secondary index (2i)

A secondary index, denoted by the abbreviation 2i, may be created to query a table based on a non-primary key column, a clustering column, or part of a composite partition key. A secondary index is created on a column of an existing table.

When an index is created, Cassandra creates a table in the background to store the index data. This table has the index column as the partition key and the primary key column(s) of the base table as clustering columns.

Upon record insertion in a table with attached secondary index, Cassandra writes to both the index, and the base table. When the table is queried based on the secondary index, Cassandra first retrieves the primary keys for all matching rows from the index table. These primary keys are then used to fetch data from the base table.

It is important to note that Cassandra does not partition and distribute the index table around the cluster like regular tables. The index table exists on each node, alongside the data it is indexing. 

To fulfill a query based on a secondary index, Cassandra reads the index table on each node and collects the required data, resulting in high latency and performance degradation.

The following are anti-patterns for secondary indexes:

  • Columns with very low cardinality (very few distinct values). e.g., a secondary index on gender or course_status results in a small number of very wide partitions.

  • Columns with very high cardinality (too many distinct values), e.g., a secondary index on customer_ID or product_SKU, resulting in the entire cluster being searched for one record.

  • Columns whose value is updated or deleted frequently. 

To demonstrate, we will create two secondary indexes for the courses_by_category table defined below.

Get hands-on with 1300+ tech skills courses.