RSS

Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)

July 20

This article beings to explore how to migrate a relational database structure (tables linked by foreign keys) into a NoSQL database structure that can be used in Cassandra.

If you do not know what Cassandra is, why NoSQL and Cassandra are important technologies or what JSON is and why you should know it, please click the links in this sentence to learn more about each topic before proceeding.

The Original Relational Database Structure

We are going to start with a very simple 1:N relational database structure. Our first two tables are “forests” and “famoustrees”.  Here is our data in tabular format:

forests:

famoustrees:

“famoustrees” is linked to “forests” using the “forestID” foreign key.  Notice that there are no famous trees in the “Lonely Grove” forest, one famous tree in the “100 Acre Woods” and two famous trees in the “Black Forest”.

If we were to represent the data in our database – call it our “biologicalfeatures” database – in JSON, it would look like this:

{
  "biologicalfeatures":
    {
    "forests" :
      {
      "forest003" :
        {
          "name" : "Black Forest",
          "trees" : "two million",
          "bushes" : "three million"
        },
      "forest045" :
        {
          "name" : "100 Acre Woods",
          "trees" : "four thousand",
          "bushes" : "five thousand"
        },
      "forest127" :
        {
          "name" : "Lonely Grove",
          "trees" : "none",
          "bushes" : "one hundred"
        }
      },
    "famoustrees" :
      {
      "tree12345" :
        {
          "forestID" : "forest003",
          "name" : "Der Tree",
          "species" : "Red Oak"
        },
      "tree12399" :
        {
          "forestID" : "forest045",
          "name" : "Happy Hunny Tree",
          "species" : "Willow"
        },
      "tree32345" :
        {
          "forestID" : "forest003",
          "name" : "Das Ubertree",
          "species" : "Blue Spruce"
        }
      }
    }
}

Denormalizing the Tables

To collapse the famoustrees table into our forests table, we need to move each famoustree entry underneath its forest entry.  We can also also remove the foreign “forestID” key from each famoustree entry – we don’t need that anymore.

However, we should retain the type of each famoustree entry we moved into the forest entry.  We can do this by adding an extra “type” value to each entry.

Finally, we could break out the original non-ID information in each forest entry into a typed section too.  We’ll tag each of these sections with a new ID of “generalinfo”.  (This is a Cassandra-friendly convention – we’ll get into this more below.)

Represented in JSON, our data now looks like this:

{
  "biologicalfeatures":
    {
    "forests" :
      {
      "forest003" :
        {
        "generalinfo" :
          {
          "name" : "Black Forest",
          "trees" : "two million",
          "bushes" : "three million"
          },
        "tree12345" :
          {
            "type" : "famoustree",
            "name" : "Der Tree",
            "species" : "Red Oak"
          },
        "tree32345" :
          {
            "type" : "famoustree",
            "name" : "Das Ubertree",
            "species" : "Blue Spruce"
          }
        },
      "forest045" :
        {
        "generalinfo" :
          {
          "name" : "100 Acre Woods",
          "trees" : "four thousand",
          "bushes" : "five thousand"
          },
        "tree12399" :
          {
            "type" : "famoustree",
            "name" : "Happy Hunny Tree",
            "species" : "Willow"
          }
        },
      "forest127" :
        {
        "generalinfo" :
          {
          "name" : "Lonely Grove",
          "trees" : "none",
          "bushes" : "one hundred"
          }
        }
      }
    }
}

Ready for Cassandra?

There are really only two types of JSON data structures that can be imported directly into Cassandra.  One is the
keystore->columnfamily->rowkey->column
data structure shown below:

{
  "keystore":
    {
    "columnfamily" :
      {
      "rowkey" :
        {
          "column name" : "column value"
        }
      }
    }
}

Add another layer and you get the other supported data structure
keystore->columnfamily (a.k.a. “supercolumnfamily”)->rowkey->supercolumn (a.k.a. “subcolumn”)->column
shown below:

{
  "keystore":
    {
    "columnfamily" :
      {
      "rowkey" :
        {
        "supercolumn" :
          {
          "column name" : "column value"
          }
        }
      }
    }
}

That’s it: if you can get your data to fit into one of those two JSON structures, your data is ready to be input into Cassandra.

You probably suspect that I wouldn’t have taken you this far if our forests data wasn’t ready for Cassandra, but please take a moment to scroll up and see if you can figure out whether our denormalized forests data uses supercolumns or not.

Let’s break it down:
biologicalfeatures -> forests
…matches the keystore->columnfamily structure used by both supported JSON structures.

As for the rest:
forest003 -> generalinfo -> (name=”Black Forest”)
…matches the rowkey->supercolumn->column structure used by the “supercolumn” supported JSON structure.

So, yes, we had to use supercolumns to denormalize the forests and famoustrees tables properly.

Next Steps

Next we’ll perform this type of analysis on the Northwind JSON structure exported in a previous article.

Doing this type of normalization by hand would be a large PITA, so DivConq created a utility to do this automatically. The article after that shows how to use that DivConq utility and a few more like it to complete the conversion of the Northwind JSON in Cassandra-ready format.

Soon after that I will also cover how to import the final JSON data directly into Cassandra – stay tuned!

About Jonathan Lampe

Author of 44 articles on this blog.

I have about 12 years of experience designing secure, partially distributed systems (e.g., web farm with some extra load in one remote data center), often in industries such as finance and defense. My solutions and software are currently deployed in mission-critical roles at about 1000 enterprises worldwide. In the last 2-3 years I have turned my technical attention toward more geographically distributed systems and heterogeneous environments (e.g., a mix of operating systems or on-premises and cloud deployments) while maintaining my focus on good UX and great security.

 

Posted by on 2010-Jul-07 in Beginner, Cassandra, nosql

Comments Off

Comments are closed.