History

Explore the different technological advancements that enabled and contributed to Cassandra's development.

What is Apache Cassandra?

Apache Cassandra is a free, open source, distributed, NoSQL database management system aimed at providing high availability, and scalability. It combines Amazon Dynamo’s distributed storage and replication, with Google Bigtable’s data and storage engine model. It is significantly different from traditional relational database management systems (RDBMS) and offers a data model based on the Cassandra Query Language (CQL). Cassandra has a database infrastructure primed for the big data demands of the 2020s and beyond.

Cassandra is used by approximately 90 percent of the Fortune 100. Some of the biggest companies on the web, namely Netflix, X (formerly Twitter), and eBay, that need to power heavy web applications, have adopted Cassandra’s distributed NoSQL architecture, which can scale rapidly and run on cheap commodity hardware while being globally distributed and always-on.

Chronology

First, let’s take a look at how database technology progressed with time, and how different technological advancements led to the development of Cassandra.

1970s — Relational data model

  • In June 1970, publication of a research paper titled “A Relational Model of Data for Large Shared Data Banks” by IBM researcher Dr. E. F. (Ted) Codd introduced a new and effective way to model data in a bunch of cross linked tables. The relational model stored any piece of data just once, saving expensive disk space. It also captured the relationship between different data records. Unlike navigational databases, relational databases were searchable.

  • In 1976, Peter Chen proposed the entity-relationship (ER) model.

  • Research at IBM from 1974–1978 resulted in the creation of System R, a prototype relational database. It was the first database to implement structured query language (SQL). 

1980s – RDBMS growth and OODBMS

  • During the 1980s, relational databases gained dominance as earlier navigational models declined, and SQL became the standard database language. The mid-1980s also marked the rise of successful commercial object-oriented database management systems (OODBMS).

  • January 1, 1983 marked the creation of the internet. Transfer Control Protocol/Internetwork Protocol (TCP/IP) enabled various computer networks to communicate with each other in a universal standard language.

  • In May 1987, Sybase released System 10, the first high-performance relational database for online applications. Instead of using a mainframe computer as a central bank for data storage, System 10 implemented a client/server computer architecture.

  • In 1989, Tim Berners-Lee invented the World Wide Web (WWW) at CERN, with HTML as its publishing language, for information sharing between scientists in universities and institutes around the world.

1990s – Web 1.0

  • In the 1990s, Informix marketed Illustra, the first commercial object-relational database management system (ORDBMS). ORDBMS extended existing relational database concepts by adding objects.

  • Web 1.0, or read-only web, refers to the first stage of the World Wide Web’s evolution, from roughly 1991 to 2004. During this decade, the use of the Internet grew rapidly for emails, mailing lists, e-commerce, online shopping using Amazon and eBay, online forums, blogs, personal websites, etc.

  • In 1993, the World Wide Web was made a public domain, opening it up to anyone who had a computer. Interest in online businesses fuelled demand for client-server database systems resulting in exponential growth of the database industry.

  • In 1998, the term NoSQL (not only structured query language) was coined by developer Carlo Strozzi, who created the Strozzi NoSQL open-source relational database. It was a relational database without an SQL interface.

The 2000s – Web 2.0, big data, and NoSQL databases

  • In 2004, the term “Web 2.0” was coined by Tim O’Reilly and Dale Dougherty. Web 2.0 refers to the era of the participative web or the social web, where the emphasis is on user-generated content, ease of use, and compatibility with an array of end user devices.

  • On February 4, 2004, Facebook was launched. Facebook created Cassandra in 2007 to solve its inbox search problem.

  • In 2005, the term “big data” was coined by Roger Mougalas from OReilly Media. Big data is known as the three V’s i.e., the data is of large volume, has greater variety, and arrives with great velocity. Big data is too large, unstructured, and complex to be managed and processed by traditional data processing software. Big data can help address business problems that could not be resolved earlier.

The advent of Web 2.0 and BigData posed a scalability problem for relational databases, which were architected to run on a single machine and could not scale up to handle large volumes of data created by billions of networked humans. The solution was to shift from one database server to a cluster of database nodes working together. The research in this regard gave rise to NoSQL databases. The term NoSQL refers to databases that use query language other than SQL to store and retrieve data, and use a distributed database system.

  • February 14, 2005, YouTube was launched by three former PayPal employees, with the intention of giving ordinary people an online platform to share their homemade videos. In a year, YouTube was offering more than 100 million videos per day.

  • April 1, 2006, Hadoop, an open-source software framework, was created by Doug Cutting and Mike Cafarella at Yahoo!. Hadoop is one of the highest-level Apache projects for efficiently storing and processing big data in a distributed manner using large clusters of inexpensive commodity hardware. 

  • In 2006, Google presented the revolutionary research paper “Bigtable: A Distributed Storage System for Structured Data,” which kicked off the NoSQL industry.

The paper described the successful implementation of its internal database, Bigtable. In 2004, Google started building a database that could provide real-time access to petabytes of data as its search engine web indexes had become massive and rebuilding them took a lot of time. Bigtable powers Google Analytics, web indexing, Gmail, YouTube, Google Maps, Google Books search, MapReduce, etc. Cassandra and Apache HBase were modeled after Bigtable.

Bigtable’s distributed storage had an innovative data model that could support structured data and provide linear scalability, reliability, low latency, and high throughput. Bigtable stores data in sparsely populated tables, with each row indexed by a single row key. The row/column intersection can contain multiple cells; each cell represents a timestamped version of the row/column data value. In case a column is not used for a particular row, it does not consume space. Compression algorithms are in place to achieve high capacity.

  • In 2007, the first iPhone was released by Apple Inc. It was the first smartphone to offer a full version of the internet. The mobile revolution greatly accelerated Web 2.0 by providing internet access to much of human society. People use smartphones on the go to communicate, take and share pictures and videos, shop, and seek information, leading to the generation of big data.

  • In 2007, Amazon released the influential research paper “Dynamo: Amazon’s Highly Available Key-value Store,” which inspired a number of NoSQL databases, including Apache Cassandra.

This paper described the learnings from its experimental database Amazon Dynamo. In 2004, Amazon.com decided to build its own database as its growth was starting to hit the upper scaling limits of its Oracle database.

As most queries accessed data only through the primary key, Amazon Dynamo used a key-value store instead of a relational data model. The data was replicated and distributed across multiple database nodes, resulting in a highly available NoSQL database. It also saved the cost of high-end hardware and skilled personnel required to manage an RDBMS.

Press + to interact
Major events leading to Cassandra (2004-2009)
Major events leading to Cassandra (2004-2009)

The advent of Cassandra

In October 2009, Facebook presented a research paper, “Cassandra - A Decentralized Structured Storage System,” at the Large-Scale Distributed Systems and Middleware (LADIS) Conference.

Amazon Dynamo innovators Avinash Lakshman and Prashant Malik developed Cassandra in 2007 to power Facebook’s inbox search feature. The aim was to develop a storage infrastructure that could provide search results for conversations and other content with low latency and high availability for incrementally growing data in a cost-effective manner.

Cassandra achieved the above requirements by combining the distribution model of Amazon’s Dynamo database, with the storage engine of Google’s Bigtable database.

Press + to interact
Cassandra: A combination of Amazon Dynamo and Google Bigtable
Cassandra: A combination of Amazon Dynamo and Google Bigtable

The architecture allowed horizontal scaling, high availability, and durability by partitioning and replicating the data across multiple nodes. For failure detection, gossip message timestamps were maintained at each node to locally determine if a node was up or down. Bootstrapping algorithm allowed nodes to be added to the system to help a heavily alleviated node, with all replicas participating in data transfer to speed up the process. For recoverability and durability, local persistence was employed. Each write operation was written into a commit log and upon success, updated an in-memory data structure. Upon reaching a threshold size, the in-memory data structure was dumped into a file on disk. A background process would merge the on-disk files periodically. The result was a highly scalable, NoSQL database that could address the most data-rich and performance-intensive use cases.

Press + to interact
Cassandra development milestones
Cassandra development milestones