RSS
 

Migrate a Relational Database into Cassandra (Part III – Northwind Conversion)

16 Aug

This article describes how to convert a JSON document containing the complete contents of the Northwind Access database into Cassandra-friendly JSON.

Before proceeding, you should understand my previous “Part 2″ article on “Northwind Planning” – this article implements that plan.

You should also have downloaded and installed either the .NET version or Mono version of the DivConq JSON command-line utilities, and should also have a complete JSON document from a conversion of the Northwind database export.  (You can also start with the “NorthwindJSON.txt” document from this archive.)

What To Do?

Copy the following batch file off into the folder that contains your DivConq command-line utilities and your Northwind JSON document.  Then open it using your favorite text editor and skip down to the “Understanding the Flow” section below.

@echo off
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem X
rem X This batch file uses several divconq.com utilities. First it exports an Access DB
rem X called "Northwind" into a single large JSON structure.  Then it manipulates the
rem X JSON structure until it conforms to a denormalized form suitable for insertion
rem X into a Cassandra DB.  Finally, it imports the JSON structure into a Cassandra DB.
rem X
rem X August 11, 2010 - Jonathan Lampe - jonathan.lampe@divconq.com
rem X
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the basic schema of the original DB
rem .
rem Categories
rem Customers
rem Employees
rem Order Details
rem   ProductID
rem   OrderID
rem Orders
rem   CustomerID
rem   EmployeeID
rem Products
rem   SupplierID
rem   CategoryID
rem Suppliers
rem .
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the data schema represented with 1:N and N:1 relationships
rem
rem    Customer }                             { Supplier
rem    Employee } 1:N Orders N:N Products N:1 { Category
rem
rem ...where the N:N relationship is expressed in the "Order Details" table
rem
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the denormalized, Cassandra-friendly schema that is constructed:
rem .
rem Keyspace=Northwind
rem   ColumnFamily=Orders
rem     RowKey=OrderID
rem       1 SuperColumn="OrderInformation"
rem         Columns=(contents from one Orders entry)
rem       1 SuperColumn="Customer"
rem         Columns=(contents from one Customers entry)
rem       1 SuperColumn="Employee"
rem         Columns=(contents from one Customers entry)
rem       0-N SuperColumns="ItemEntry_"+[NumericCount]+"
rem         Columns=(contents from one Order Details entry) +
rem                 (contents from one Product entry)
rem                 (contents from one Supplier entry)
rem                 (contents from one Category entry)
rem .
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Bring in 0-N item entries per order, each as a new supercolumn
consolidatejsonobjects NorthwindJSON.txt JSONout.txt Orders "Order Details" ItemEntry_[ID] "OrderID" yes yes 4
rem .
rem Bring in exactly one employee per order as a new supercolumn
distributejsonobjects JSONout.txt JSONout2.txt Orders "Employees" Employee "EmployeeID" yes yes 4
rem .
rem Bring in exactly one customer per order as a new supercolumn
distributejsonobjects JSONout2.txt JSONout3.txt Orders "Customers" Customer "CustomerID" yes yes 4
rem .
rem Bring in exactly one product entry per item entry as new columns in existing supercolumn
distributejsonobjects2 JSONout3.txt JSONout4.txt Orders ItemEntry_* Products Product_ ProductID yes yes 4
rem .
rem Bring in one supplier entry and one category entry per item entry as new columns in existing supercolumn
rem   (note these depend on previously imported Product entries)
distributejsonobjects2 JSONout4.txt JSONout5.txt Orders ItemEntry_* Suppliers Supplier_ Product_SupplierID yes yes 4
distributejsonobjects2 JSONout5.txt JSONout6.txt Orders ItemEntry_* Categories Category_ Product_CategoryID yes yes 4
rem .
rem Demote remaining rowkey-columns into a new supercolumn
rem .
demotejsonpairsintonewobject JSONout6.txt JSONout7.txt Orders OrderInformation yes yes 4

Understanding the Flow

The top part of the batch file simply reinterates what we planned in the previous article.  The end is where all the processing actually takes place.  Within the “rem”ed lines there are 7 JSON conversions, each performing a single piece of the consolidation.

Step 1: We open the list of “Order Details” and pull in zero, one or multiple entries as new “ItemEntry_[ID]” nodes.

Steps 2 and 3: We reach out and import exactly one Customer node and one Employee node into our master Order record.

(Note that Steps 1, 2 and 3 can really occur in any order because these three entities are completely independent.  However, “Order Details” must have been imported before Step 4 can take place.)

Step 4: We reach out and import the contents of exactly one Product node into each existing ItemEntry node by looking up ProductIDs from existing ItemEntry name/value pairs.

(Note that Step 4 – the Product import – must take place before Steps 5-6 can take place.)

Steps 5-6: We reach out and import the contents of exactly one Category node and exactly one Supplier node into each existing ItemEntry node by looking up CategoryIDs and SupplierIDs from recently imported ItemEntry > Product information.

Step 7: Consolidate the remaining entries under the main Order entry into a new OrderInformation node.

(Step 7 – the consolidation of hanging entries – should always go last in case previous steps depend on keys in their original positions.)

To see complete sample output from each step of the conversion (as generated by the .NET utilities), unpack the archive here.

Running the Batch File

You may need to change the name of the original “NorthwindJSON.txt”  and/or may want to change the name of the final output file (from “JSONout7.txt”), but other than that the batch file should be ready to run.  To capture log and error output so it’s easier to read, use this command:

D:\divconq\dotnet>ConsolidateNorthwindJSON.bat > log.txt 2> err.txt

Next Steps

The next article in this series shows how to import the complete JSON document into Cassandra.

More About The Utilities

ConsolidateJSONObjects is used to dive into a foreign table used to express an N:N relationship from the point of view of one entity in the relationship.  It generates 0-N supercolumn-level JSON nodes for each foreign node discovered.  This utility is used once during the Northwind conversion to bring in 0-N item entries per order.

consolidatejsonobjects NorthwindJSON.txt JSONout.txt Orders "Order Details" ItemEntry_[ID] "OrderID" yes yes 4

DistributeJSONObjects and DistributeJSONObjects2 are used to select foreign keys in the main table and look them up against a foreign table: it works N:1 relationships and brings in exactly one node for each foreign key scanned (assuming DB consistency, of course).   The difference between these utilities is that DistributeJSONObjects imports discovered nodes as supercolumn-level nodes while DistributeJSONObjects2 imports the contents of discovered nodes as column-level name-value pairs under an existing supercolumn-level node.  These utilities are used five times during the Northwind conversion: two invocations of DistributeJSONObjects to import 1-per-order Employee and Customer nodes as supercolumns and three invocations of DistributeJSONObjects2 to import the contents of 1-per-itementry Product, Supplier and Category nodes.

rem Bring in exactly one employee per order as a new supercolumn
distributejsonobjects JSONout.txt JSONout2.txt Orders "Employees" Employee "EmployeeID" yes yes 4
rem .
rem Bring in exactly one customer per order as a new supercolumn
distributejsonobjects JSONout2.txt JSONout3.txt Orders "Customers" Customer "CustomerID" yes yes 4
rem .
rem Bring in exactly one product entry per item entry as new columns in existing supercolumn
distributejsonobjects2 JSONout3.txt JSONout4.txt Orders ItemEntry_* Products Product_ ProductID yes yes 4
rem .
rem Bring in one supplier entry and one category entry per item entry as new columns in existing supercolumn
rem   (note these depend on previously imported Product entries)
distributejsonobjects2 JSONout4.txt JSONout5.txt Orders ItemEntry_* Suppliers Supplier_ Product_SupplierID yes yes 4
distributejsonobjects2 JSONout5.txt JSONout6.txt Orders ItemEntry_* Categories Category_ Product_CategoryID yes yes 4

DemoteJSONPairsIntoNewObject is simply used to clean up lingering columns (name/value pairs) at the “row” level and put them into a new supercolumn-level object.   One invocation of this utility is used during the Northwind conversion.

demotejsonpairsintonewobject JSONout6.txt JSONout7.txt Orders OrderInformation yes yes 4

All utilities will display additional help, including complete documentation for each parameter at the command-line if invoked without any command-line parameters.  (e.g., simply execute “consolidatejsonobjects.exe” to find out what the eighth parameter – “yes” in the example above – means)

Troubleshooting

If you are experimenting with these utilities and data sets of any size it is highly recommended that you put all your commands in a batch file.  Then, from the command prompt, run your batch file like this:

D:\divconq\dotnet>divconq_test23.bat > log.txt 2> err.txt

…to capture logging output and error output in two separate files.

About Jonathan Lampe

Author of 44 articles on this blog.

I have about 12 years of experience designing secure, partially distributed systems (e.g., web farm with some extra load in one remote data center), often in industries such as finance and defense. My solutions and software are currently deployed in mission-critical roles at about 1000 enterprises worldwide. In the last 2-3 years I have turned my technical attention toward more geographically distributed systems and heterogeneous environments (e.g., a mix of operating systems or on-premises and cloud deployments) while maintaining my focus on good UX and great security.

Published on Monday August 16, 2010 at 08:53am

 
Comments Off

Posted in Cassandra, Intermediate, JSON, Northwind

 

Comments are closed.