Features

Here are some key features and characteristics of Cassandra:

Distributed and decentralized

Cassandra is distributed and decentralized by design. It runs on multiple machines while appearing as a unified whole. It distributes and replicates data on a number of nodes across multiple data centers in the cluster. 

Cassandra has a peer to peer architecture. Each node in the cluster is identical and has the same role. A gossip protocol maintains a list of nodes that are alive. Thus, any node can service any read/write request ensuring no single point of failure and high availability.

Press + to interact
Cassandra's distributed & decentralised architecture
Cassandra's distributed & decentralised architecture

Elastic scalability

Cassandra guarantees scale-up linearity, ensuring no performance degradation while scaling up or down. This is in stark contrast to relational databases which neither scale linearly, nor allow scaling without downtime. Cassandra has been designed to maintain a quick response time and increase read and write throughput linearly with the addition of new nodes to the cluster. Nodes can be added or removed, seamlessly, without any disturbance to the applications.

Press + to interact
Cassandra's architecture scales linearly, with added nodes boosting performance and throughput
Cassandra's architecture scales linearly, with added nodes boosting performance and throughput

Netflix has been using Cassandra in production. The chart below from Netflix demonstrates Cassandra’s linear scalability as per “Benchmarking Cassandra Scalability on AWS — Over a million writes per second”.

Press + to interact
Cassandra’s scale-up linearity at Netflix
Cassandra’s scale-up linearity at Netflix

High availability and fault tolerance

Cassandra is architected as a distributed, peer-to-peer system, with any node/installation capable of handling any read/write request, making it highly available. It offers configurable replication strategies. The automatic replication of data across multiple nodes ensures fault tolerance and improved performance. Cassandra’s ability to replace failed nodes without downtime guarantees disaster recovery. Cassandra is tailored for multiple datacenter deployments and supports replication across multiple datacenters. This makes it suitable for applications that can’t afford to lose data.

Failure detection is achieved by maintaining and monitoring a sliding window of gossip message arrival times at each node. If a node is down, the operations it has missed are saved as hints. Messages are periodically sent to re-establish connection. Once the node comes back online, repair mechanisms (hinted handoffs or manual repair) are in place to synchronize and exchange the operations it missed.

Tunable consistency

In the tradeoff between availability and consistency, Cassandra favors availability by default. However, consistency levels set by the client offer tunable consistency. Through read repair (reads will return consistent data) and hinted handoff (data to be written by a currently down replica is re-sent as soon as the replica node returns to the ring), Cassandra ensures eventual consistency. 

CAP theorem and Cassandra

The CAP theorem, also known as Brewer’s theorem, states that, in any distributed system, only two of the three desirable characteristics can coexist simultaneously: consistency (C), availability (A), and partition tolerance (P).

  • Consistency means every read obtains the most recent write. 

  • Availability means every read receives a response. The data read may be stale.

  • Partition tolerance means the system is operational despite node and network failures. Partition here refers to a communication break between nodes.

The CAP theorem implies that a distributed system, with the inherent risk of partition failures, needs to choose between consistency and availability. Hence, a distributed system must always be partition tolerant and can either be CP or AP.

A CP system works to provide consistent data at all times and will trade off availability by canceling requests when faced with partition failures. Meanwhile, an AP system will safeguard availability and compromise consistency by providing stale data when partition failures occur.

Press + to interact
CAP theorem
CAP theorem

Cassandra is an AP database by design and has been optimized for availability and partition tolerance. Its peerless architecture helps deliver constant availability and partition tolerance, resulting in a highly performant system. It offers tunable consistency, whereby availability may be tweaked to provide consistency based on the consistency needs. It also features quick repair functionalities to reconcile inconsistencies and eventually become consistent in a matter of milliseconds, thus making the tradeoff unnoticeable and worthwhile in most cases.

Platform agnostic

Cassandra is vendor agnostic, granting maximum flexibility. A single database may be deployed on-premises, on any single cloud provider, on multiple cloud providers, or any combination of these.

Press + to interact
Apache Cassandra supports hybrid and multi-cloud deployments
Apache Cassandra supports hybrid and multi-cloud deployments

Open-source

Being an open-source distributed NoSQL database, Cassandra is maintained by the Apache Software Foundation. The open-source nature of Apache Cassandra provides the following advantages: 

  • It allows organizations to use Cassandra without incurring licensing costs, making it a cost-effective and vendor-independent solution. 

  • The open nature of the project encourages community involvement, resulting in a wealth of community-driven resources, documentation, and support. 

  • Being open-source ensures transparency, as users can review the source code and understand how the database works, promoting trust and security.

  • The open-source model enables organizations to customize and extend Cassandra to meet their specific requirements. This extensibility empowers users to tailor the database to their needs and leverage its flexibility for evolving data models and scaling demands.

  • Organizations adopting open-source experience higher innovation speed.

Overall, Apache Cassandra’s open-source nature provides affordability, community support, transparency, and flexibility, making it an attractive choice for organizations seeking a robust and scalable distributed database solution.

Flexible schema

Apache Cassandra offers flexible schema, one of its distinguishing features compared to traditional relational databases. Cassandra supports the addition of new columns on the fly, even for existing data. This means that as the data evolves, new columns can be introduced without requiring downtime or schema migrations. 

Additionally, Cassandra offers three types of collections: lists, sets, and maps. These allow for storing multiple values within a single column, providing an efficient way to handle complex and varied data structures.

Furthermore, Cassandra supports sparse columns, with each row only containing columns populated with data, providing efficient storage.

Press + to interact
Flexible schema
Flexible schema

Overall, Cassandra’s flexible schema provides the agility and scalability required for handling large amounts of evolving and diverse data. It allows for easy adaptability, dynamic data modeling, and seamless schema evolution without compromising the performance and availability of the database.

High performance

Cassandra is designed to provide high performance under heavy workloads. Its distributed, peer-to-peer architecture allows linear scalability while enabling parallel processing and efficient data retrieval. 

Cassandra uses a combination of in-memory and on-disk structures to store and retrieve data. Writes are performed at the speed of wire from micro to milliseconds, while reads are performed within milliseconds, regardless of whether the database spans three nodes or thousands of nodes.

Apache Cassandra’s scalability, fault tolerance, and low-latency data operations make it suitable for large-scale, data-intensive, demanding applications.

CQL

CQL (Cassandra Query Language), introduced in release 0.8, is a query language specifically designed for Cassandra. It provides a structured and user-friendly way to interact with the database.  

Press + to interact
CQL features
CQL features

CQL has SQL-like syntax providing ease of use for developers familiar with relational databases. CQL provides a powerful and intuitive interface for interacting with Apache Cassandra, allowing developers to efficiently create, modify, and query the database while leveraging Cassandra’s distributed architecture, scalability, and tunable consistency.

Use cases

Based on the features described above, Cassandra excels in high scalability, high availability, and low latency data access and provides robust data management capabilities. Its distributed architecture, fault tolerance, and ability to handle massive amounts of data make it a popular choice for applications dealing with big data, real-time analytics, and mission-critical systems. Some notable use cases for Cassandra are listed below:

Big data and analytics

Cassandra is often employed in big data analytics environments, where it can efficiently handle high-volume, high-velocity, and high-variety data. It enables real-time analysis of large datasets, making it suitable for applications like fraud detection, recommendation engines, Internet of Things (IoT) data processing, etc.

Time series data applications

Cassandra’s ability to handle time-series data makes it valuable for applications that involve collecting and analyzing data over time, such as financial systems, IoT sensor data, monitoring and logging systems, real-time event tracking, etc.

High throughput workloads

Cassandra’s distributed architecture and linear scalability allow it to handle high-throughput workloads with ease. It can support millions of read-and-write operations per second, making it suitable for applications that require handling massive concurrent requests, such as social media platforms, e-commerce websites, content management systems, etc.

Geographically distributed deployments

Cassandra’s built-in support for multi-datacenter replication enables a single Cassandra database to spread across multiple datacenters in different regions of the world. For example, data written to a node in a datacenter located in America is automatically replicated to datacenters in Europe, Asia, and Australia at the speed of wire without any manual intervention, allowing immediate access to the replicated data in Australia. This capability makes Cassandra perfect for applications that require global scalability and low-latency data access, such as global financial systems, multi-region e-commerce platforms, content delivery networks (CDNs), etc.

High write workloads

Cassandra’s architecture is optimized for write-heavy workloads. It employs a log-structured storage design that enables efficient and fast write operations. This makes it suitable for applications that involve continuous data ingestion, such as data streaming platforms, real-time analytics, transactional systems, etc.

Elastic scalable applications

Cassandra’s ability to scale horizontally across commodity hardware allows it to handle growing datasets and increasing traffic loads. It can easily add or remove nodes to accommodate changes in demand. This built-in scalability and elasticity make it a valuable choice for applications that experience unpredictable growth patterns or need to scale rapidly, such as online gaming platforms, social networks, ad-serving systems, etc.