RSS

Migrate a Relational Database into Cassandra (Part II – Northwind Planning)

July 29

This article describes how to fit a relational database structure containing an N:N relationship and several N:1 relationships into a Cassandra-friendly structure.

You should be familiar with the concepts in my “Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)” article before proceeding.

In this article we are going to examine the “Northwind” database shipped as an example with Microsoft Access.  Having a local copy of this database not needed to understand this article, but if you are interesting in obtaining a copy (and/or want to see how to extract it to JSON), visit this article and then return.

Northwind Database Schema

The relationships of the various tables that make up the Northwind Access database can be summed up like this:

Customer }                            { Supplier
Employee } 1:N Order N:N Products N:1 { Category

In other words:

  • Every order is tied to a single customer and a single employee
  • Every product is tied to a single supplier and a single category
  • Every order can have a variety of different products
  • Every product can be part of a variety of different orders

Notice that I’m ignoring the “N” part of the customer, employee, supplier and category relationships for now.  I’m also ignoring the “Order Details” table that provides the links necessary to implement the “N:N” relationship depicted above.

Instead, I’m focusing on the important “N:N” relationship between order and product.  In a relational database setting I can often simply normalize away and worry about which table I need to focus my indexing efforts on later when I’m working in the application.  However, in NoSQL, non-relational database design, we often need to decide up front which entity most queries will be interested in and build everything else around that entity.

So…will it be “order” or “product”?  Today I’ll decide that the key entity in this database is “order” – customers will be hitting this on a daily, per transaction basis whereas I can probably run my product reports offline.

Sketching Out a Viable Cassandra Schema for the Northwind Database

OK, so “order” is our key element.  Let’s see how we can squeeze that into a Cassandra-friendly data structure.  First, let’s review Cassandra’s keystore->column family->row->super column->column data structure.  (If you need more of a refresher, see this.)

{
  "keystore" : {
    "column family" : {
      "row key" : {
        "supercolumn key" : {
          "column name" : "value"
        }
      }
    }
  }
}

Remember that the first two levels of this structure (keystore->column family) will essentially be burned up with the name of the database and the name of table.  That leaves us one level (row) for each “order” entry and one level (supercolumn) for each subcomponent of each “order” entry (such as a block of employee information).  The remaining level (column) has to be used for the name/value pairs that contain our information.

That’s a tight fit, but not an impossible one.  Let’s get started on the side of the relational diagram that only contains N:1 relationships.

Customer }
Employee } 1:N Order

Again, we’re going to ignore the fact that a single customer can be tied to multiple orders and that a single employee can also be tied to multiple orders.  Instead, we’re going to focus on the fact that a single order has one customer and one employee – period.

Here’s a data structure that accommodates that line of thinking.

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "OrderInformation" : {
          "column name" : "value"
        }
        "Employee" : {
          "column name" : "value"
        }
        "Customer" : {
          "column name" : "value"
        }
      }
    }
  }
}

So far, so good.  But where did that extra “OrderInformation” entry come from?  The answer is that we demoted the original information from the original order record so that column name/value pairs are always at the same level in our Cassandra database.  (More on that.)

Tackling the N:N Relationship

Now what should we do about the “N:N” relationship from orders to products?

                       { Supplier
Order N:N Products N:1 { Category

We could call each N:N Order:Product relationship an “ItemEntry” and consider doing something like this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntry_ID]" : {
          "ItemInformation" : {
            "column name" : "value"
          }
          "Product" : {
            "column name" : "value"
            "Category" : {
               "column name" : "value"
             }
            "Supplier" : {
               "column name" : "value"
             }
          }
        }
      }
    }
  }
}

…but that would be illegal because it requires two more levels than Cassandra allows (e.g., “Product > Category”).

Instead we need a structure that looks more like this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntryID]_ItemInformation" : {
            "column name" : "value"
          }
          "[ItemEntryID]_Product" : {
            "column name" : "value"
          }
          "[ItemEntryID]_ProductSupplier" : {
            "column name" : "value"
          }
          "[ItemEntryID]_ProductCategory" : {
            "column name" : "value"
          }
        }
      }
    }
  }
}

…or this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntryID]" : {
            "ItemInformation_column1" : "value",
            "ItemInformation_column2" : "value",
            "Product_column1" : "value",
            "Product_column2" : "value",
            "ProductSupplier_column1" : "value",
            "ProductSupplier_column2" : "value",
            "ProductCategory_column1" : "value",
            "ProductCategory_column2" : "value"
          }
        }
      }
    }
  }
}

It looks good on paper but there’s an extra problem with the “Order Details” database we’re using to provide the N:N relationship in the Northwind database.  The extra problem is that there are no unique IDs on the Order Detail entries.  With that in mind, we’ll need to take special care during export/consolidations of legacy Order Detail data and with new entries in the future to automatically populate unique entries for each Order Detail as we add it.  (This isn’t a big deal – we could either go with unique counts – e.g., 1, 2, 3… – within each Order entry or commit to the use of GUIDs for the same purpose)  This exercise uses unique IDs added to the original Northwind JSON document through the use of the “ID4Tables” option in the ExportAccessDBToJSON utility.

With all that in place, here is our final, Cassandra-ready data structure with an “order”-centric focus and all the information from our original Northwind relational structure.

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "OrderInformation" : {
          "column name" : "value"
        }
        "Employee" : {
          "column name" : "value"
        }
        "Customer" : {
          "column name" : "value"
        }
        "ItemEntry_[ItemEntryID]" : {
            "ItemInformation_column1" : "value",
            "ItemInformation_column2" : "value",
            "Product_column1" : "value",
            "Product_column2" : "value",
            "ProductSupplier_column1" : "value",
            "ProductSupplier_column2" : "value",
            "ProductCategory_column1" : "value",
            "ProductCategory_column2" : "value"
        }
      }
    }
  }
}

Next Steps

Next I will show exactly how to use DivConq utilities to implement our plan and convert our Northwind JSON to a Cassandra-ready format.

About Jonathan Lampe

Author of 44 articles on this blog.

I have about 12 years of experience designing secure, partially distributed systems (e.g., web farm with some extra load in one remote data center), often in industries such as finance and defense. My solutions and software are currently deployed in mission-critical roles at about 1000 enterprises worldwide. In the last 2-3 years I have turned my technical attention toward more geographically distributed systems and heterogeneous environments (e.g., a mix of operating systems or on-premises and cloud deployments) while maintaining my focus on good UX and great security.

 

Posted by on 2010-Jul-07 in Cassandra, Intermediate, Northwind

Comments Off

Comments are closed.