In my previous two posts – Simple, Secure and Speedy part one and part two – I pointed out that one of the most limiting factors in scaling web applications is the database. So while we could keep it simple and reasonably secure using commonly used components (PHP, MySQL, lighttpd, Nginx), the scalability story is where the issues begin. In part two I pointed out that both the application servers and the reverse proxy servers could scale fairly well. But what about the database?
Architecturally there are three topics worth considering when looking to scale a database. First is performance, second is local availability, and third is global availability. Clustering a relational database, such as MySQL, does provide performance and local availability – but it does not provide global availability. In other words, it does not span data centers and geography.
There are solutions for spanning data centers with relational databases – SQL Server has some replication features to help, as do a few other projects. From a cloud perspective Amazon Relational Database Service is an interesting solution to the problem.
While I’m glad there are possible solutions out there, my main issue is that relational databases really are not designed to be distributed. I think we can do better. As a veteran of nosql (aka post-relational) database development I think that ‘nosql’ style databases will be easier to develop on and deploy/manage in a distributed (multi-data center / global) world. Especially if you want to maintain provider neutrality or if you want on-premises (or hybrid) deployments.
BTW, if you think nosql hasn’t been around long – think again – MUMPS was an excellent example of nosql concepts and is alive and well in the form of GT.M and at the roots of InterSystems Caché®. There is even multi-data center support. As much as I appreciate these projects, and the many other nosql projects out there, at the end I think Apache Cassandra has some of the best ideas in terms of balance of speed and accuracy and overall good design.
As such, you’ll be seeing more many blogs about Cassandra. Do I see any drawbacks to Cassandra? Well yes, biggest of which is secondary index support is all manual – but that is par for the course with nosql. And lack of stored procedures, though Hadoop and Pig kind of address that. Other than that my biggest gripe if the poor naming of the data model. The words ‘columns’, ‘rows’ and ‘super’. If you feel this way too and are still trying to get Cassandra’s data model then consider this revised naming scheme – once you get it then moving back to the columns/super columns terminology may be easier.
Here is the short and sweet intro to Cassandra with the revised names:
Cassandra can hold one or more data stores per node. Each data store is named. Each data store has one or more families of data. Each family contains zero or more family keys. A family key is always a string value. When present a family key contains one or more field keys. The data stores and the families they contain must be given a name in the storage-conf.xml file. Furthermore the data type of the field key must be declared.
There are two kinds of families: Standard and Extra. Using the Cassandra CLI you can interact with the two families as such:
Standard Family
set <store_name>.<family name>['<family key>']['<field key>'] = '<value>'
Extra Family
set <store_name>.<family name>['<family key>']['<extra key>']['<field key>'] = '<value>'
Again – the ‘<store_name>.<family name>’ part needs to be declared in the config. The data type of the family key is always string. The data type of the field key (and the extra key, if present) must also be declared.
The definitions look something like this:
<Store Name="<your store name>">
<Family Name="<your family name>" CompareWith="<field key type>" />
<Family Name="<your family name>" Type="Extra" CompareWith="<extra key type>" CompareSubcolumnsWith="<field key type>" />
</Store>
That’s it. There are no other family types, no other configuration to worry about – very simple once you take the words ‘column’, ‘super’ and ‘row’ out of the mix. Look back at the various descriptions of the Cassandra data model and see if there is more clarity now.
Also, these diagrams might help with the terminology – thanks to Chaker Nakhli for that helpful article.
On this blog we will provide some real world examples of smallish applications modeling data in Cassandra.