In this article, I will share my experience with Cassandra and how you can manage big data in an effective way. Apache Cassandra is a high-performance, extremely scalable, fault-tolerant (i.e., no single point of failure), distributed non-relational database solution. But Cassandra differs from SQL and RDBMS in some important aspects. If, like me, you come from the world of SQL databases, it’s hard to understand Cassandra’s data concept. It took me several weeks to do so. So let’s see what is the difference.
Data Management in RDBMS
In a relational database management system (RDBMS) data is organized in databases. A database is a big store on a server from which your data is written and read. You organize your data in Tables and select the data via the query language SQL. It does not matter how big your data is. You can append million of rows to a single table and you can be sure that selecting specific data is still efficient.
I want to take the example of a Document archive to show how this looks like in a RDBMS and in Cassandra. Lets assume we have an archive of documents created by different authors on different days. In SQL we can create a table like this:
CREATE TABLE Archive ( DocId long NOT NULL, Author varchar(255), Created date, Document blob, PRIMARY KEY (DocId) );
With this simple data schema we can store documents with the meta information about the ‘Author‘ and the date the document was ‘Created‘ into the archive. The documents are stored as binary data into the column ‘Document‘.
With SQL it is very easy to select data. For example select a specific Document…
SELECT * from Archive WHERE DocId=1;
… or you can for example select all Documents created by Bob
SELECT * from Archive WHERE Author="Bob";
All the data is stored in one database on one server. Now lets see how data is organized in Cassandra.
Data Management in Cassandra
The data management of Apache Cassandra is completely different from a RDBMS. Your data is no longer stored in one database on one server. It can be distributed on different data nodes on different servers. Also your data can be replicated on several nodes running on different locations. You don’t need to think much about this fact as Cassandra is managing the distribution of your data in a transparent way.
But the different is how data is organized. In different to a RDBMS, data in Cassandra is not stored in one big database file, but rather in a large amount of single data files called ‘partitions’. These data files (partitions) are distributed over your cluster automatically by Cassandra. So it may happen that one part of your data is stored on a data node running in one data center and another is stored on a data node running in another data center on a different geographic location.
When you design your data schema you must define the so called ‘Partition Key’ to cluster your data:
CREATE TABLE IF NOT EXISTS Archive ( DocId long Author text, Created date, Document blob, PRIMARY KEY (DocId));
In this example it is the DocId which identifies your partition. So far this is very similar to SQL. To select a specific document you can use the following select statement in Cassandra:
SELECT * FROM Archive WHERE DocId=1;
So everything looks quite easy and equal to SQL. The Problem with our table schema occurs when we try to select all documents created by the Author ‘Bob’.
SELECT * FROM Archive WHERE Author="Bob"; => ERROR - Author is no partition key!
This results in an error because in Cassandra you can not select data without knowing the partition key! And the column ‘Author’ is no partition key. To solve this we could, of course, change our partition key in the following way:
CREATE TABLE IF NOT EXISTS Archive ( DocId long Author text, Created date, Document blob, PRIMARY KEY (Author));
Now we can select all documents from “Bob”. But here we are running in a more serious problem with partitions in Cassandra.
The Concept of Partitions
As explained before in Cassandra your data is physical stored in partition files. All data stored in one partition is loaded in memory when Cassandra operates on a partition. And for that reason a partition should not be much larger than about 100 MB as a rule of thumb. When we assume that each author creates in average more than 100 documents and a document has in average about 20MB than our partitions becomes 2GB of sizes. And this will result in a performance issue we should avoid. Also with the changed partition key we are now no longer able to select one document by its DocId because the DocId is no longer part of our partition key.
To get rid from the problem you must think of cassandras partitions like of Indexes in SQL. If you want to select all documents created by an Author you need to separate your data in two partitions. See the following new data schema:
CREATE TABLE IF NOT EXISTS Archive ( DocId long Created date, Document blob, PRIMARY KEY (DocId)); CREATE TABLE IF NOT EXISTS DocumentsByAuthor ( Author text, DocId long PRIMARY KEY (Author));
The first table holds our binary data. The second table is more like an index and can be used to select the DocIds for a specific author. For both tables the partition size is now small enough so we do not run in memory or performance issues any more. An author can create millions of document without the issue that our partition will become to large.
The only thing that is unusual in compare to SQL is the fact that we now need two select statements if we want to collect all documents created by an Author:
// first select all ids created by Athor Bob SELECT * FROM DocumentsByAuthor WHERE Author="Bob"; // next iterate over all ids and select the data.... SELECT * FROM Archive WHERE DocId=1; SELECT * FROM Archive WHERE DocId=5; ... ..
This is because the concept of JOINs did not exist in Cassandra. This is for the same reason as the size of a single partition.
There are various tricks and best practice how you can design your table schema in Cassandra. And it is not so hard to learn this concept if you keep in mind that a partition in Cassandra is the technical core concept which makes the database so scalable. With the data schema above we can easily store millions or even billions of documents in a Cassandra cluster. A cluster with several nodes has no single point of failure and the data is automatically distributed by Cassandra in the most efficient way. Also reading data is extremely fast. And the same things are hard to archive with a RDBMS running on a single data node. See also my Blog Post “Cassandra – How to Handle Large Media Files“.
If you want to learn more about the concepts how to handle big data in an efficient way I can recommend the Book “Designing Data-Intensive Applications” by Martin Kleppmann. This book explains perfectly why we need to rethink how to deal with big data in a world where effective data managements becomes a core concept of modern applications.