The primary theme in the NoSQL camp is “big data” while the primary theme in the SQL camp is very strict adherence to ACID. I think both themes fall short of the true challenge we face in modern which is distributed computing, specifically multi-data center active/active applications where a failure in one center will not prevent other data centers from operation.
Many NoSQL databases have some support for database replication so at first glance NoSQL may appear to be the stronger candidate for distributed computing. This is likely further reinforced by the very strict ideals SQL developers embrace which feel limiting to distributed developers. None-the-less, I have been thinking lately about adopting RDBMS design into distributed applications.
The first challenge for a strict SQL developer is accepting that committed data will not always be the data read by an application. Imagine a distributed network of database nodes, each node with a complete copy of the database and each node is live. An application writes a new value to database node A. Soon after, before the data has replicated from node A to node B, another application updates the same record. Once the replication occurs that original update (on A) will (probably) be lost completely.
There is a desire in the SQL developer to believe that once data is committed that it will never be lost, because subsequent updates would first read the new value or because subsequent updates will update only the relevant fields.
Let’s try to get a real world example of what this problem is. Imagine a medical records database with a Patient table. We’ll look at just a few values in the database for the patient Katie Ronin.
Id Name Primary Phone Last Vitals Blood Pressure
6388 Katie Ronin 111.222.6666 May 18, 2010 122/79
Now lets say that a few weeks ago Katie filled in a form for her insurance company and on that form she updated her phone number to 111.555.3333. Today (Dec 12, 2010) a data entry technician is entering her new phone number from the form into the database. Also today Katie saw her doctor and her vitals where taken. Her blood pressure is being entered by the nurse as 119/78.
Unfortunately both the nurse and technician enter the data at the exact same time. With a single database node, and when using transactions, it is reasonable to suppose that there will be no problem. Hopefully the technician’s update will look like this:
UPDATE Patient SET Phone='111.555.3333' WHERE Id = 6388
And the nurses update will look like this:
UPDATE Patient SET LV='Dec 12, 2010', PB='119/78' WHERE Id = 6388
So neither update will be lost, nor will the updates collide because of Atomicity and Isolation (from ACID).
But what is this happens in a distributed database? Is there are problem? Probably there is, and here is why.
- If we want database nodes to be live/active while isolated (e.g. due to network errors) then we will probably end up a requirement to replay transactions in any order. More on this in a future blog.
- If we need to replay transactions in any order then our replication mechanism will copy snapshots of entire records, not just the changed fields
- This means that updates can be lost, see below.
So again with our example, only this time the update from the technician is on database node 204 while the update from the nurse is on node 71. After the update, but before the replication, here is what the database for the two nodes look like:
Id TxStamp Name Primary Phone Last Vitals Blood Pressure
6388 20101212T104452204 Katie Ronin 111.555.3333 May 18, 2010 122/79
And
Id TxStamp Name Primary Phone Last Vitals Blood Pressure
6388 20101212T104452071 Katie Ronin 111.222.6666 Dec 12, 2010 119/78
After the replication, which again copies the entire record, the final record will be the one with the phone number update because that is the higher TxStamp. The vitals data has been lost.
I’ll go into more detail on replication issues in coming blogs, but what I’ll focus on even more are strategies to avoid these issues. Among these will be strategies for distributed database normalization and distributed aggregate data. Ultimately with some good planning we can find a workable solution to the above problem while still using RDBMS technologies.
NoSQL, such as Cassandra, would give us an advantage in the problem demonstrated above. However, it may not help with other problems we will discuss. As such I’ll be making references to NoSQL concepts when showing some solutions for SQL developers.