RSS

Category Archives: Intermediate

Migrate a Relational Database into Cassandra (Part IV – Northwind Import)

This article shows how to prepare and import a dataset expressed in Cassandra-friendly JSON into a Cassandra datastore using Cassandra’s “json2sstable” utility.

Before proceeding, you should understand my previous “Part 3″ article on “Northwood Conversion” – this article imports the JSON dataset created in that article.

You should also have downloaded and installed either the .NET version or Mono version of the DivConq JSON command-line utilities, and should also have a complete JSON document from a conversion of the Northwind database export. (You can also start with the “JSONout7.txt” document from this archive.)

Cleaning the Data

So far everything we’ve done has simply moved data around.  This has led to a JSON structure that contains everything and then some from the original relational database.  We could import that, but from here on out we’ll treat this data more like a traditional data warehouse by only working with a subset of the original data.

To do the stripping, we can use a DivConq utility called “StripNodesFromJSON”.   The following batch snippet cuts out extra nodes (like “Shippers”) and tags (like “Phone”).

rem Let us turn this structure into something we can use in a data warehouse
rem Strip a lot of the extra tags out
rem x Get rid of the Shippers node
StripNodesFromJSON JSONout7.txt JSONout8.txt Shippers
rem x Get rid of extra nodes from the Employee node
StripNodesFromJSON JSONout8.txt JSONout9.txt PostalCode Photo Address ReportsTo HireDate HomePhone Notes BirthDate Extension
rem x Get rid of extra nodes from the Customer node
StripNodesFromJSON JSONout9.txt JSONout10.txt City Phone Region ContactTitle Address PostalCode Fax ContactName
rem x Get rid of extra nodes from the ItemEntry nodes
StripNodesFromJSON JSONout10.txt JSONout11.txt OrderID Product_UnitPrice Product_UnitsInStock Product_QuantityPerUnit Product_ReorderLevel Supplier_City Supplier_Region "Order Details_AutoID" Product_CategoryID Supplier_ContactTitle Supplier_ContactName Product_Discontinued Supplier_HomePage Supplier_PostalCode Supplier_Address Category_CategoryID  Category_Picture Category_Description Supplier_Fax Supplier_Phone
rem x Get rid of extra nodes from OrderInformation
StripNodesFromJSON JSONout11.txt JSONout12.txt OrderID, ShipPostalCode ShipCountry CustomerID EmployeeID

If you do a directory listing on the intermediate files created in this batch file you should see that each one is smaller than the one before it.

Cassandra’s JSON2SStable Format

If you’ve worked with Cassandra’s SStable2JSON utility, you’ve seen that the format Cassandra uses for its JSON datasets is not human-readable.

Cassandra’s SStable2JSON utility will export plain (no supercolumn) Column Families like this:

{
  "HotWheelsCar": [
    ["5072696365", "312e3439", 1278132336497000, false],
    ["53656374696f6e", "56656869636c6573", 1278132515996000, false]
  ],
  "GumDrop": [
    ["5072696365", "302e3235", 1278132306875000, false],
    ["53656374696f6e", "43616e6479", 1278132493790000, false]
  ]
}

…and will export supercolumn-filled Column Families like this:

{
  "ultralights": {
    "756c3233": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["7365617432", "392070656f706c65", 1283394499763000, false]
      ]
    }
  },
  "planes": {
    "706c616e65313436": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["726f773138", "372070656f706c65", 1283394371843000, false],
        ["726f773237", "322070656f706c65", 1283394387348000, false]
      ]
    },
    "706c616e65353436": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["726f773232", "332070656f706c65", 1283394349929000, false]
      ]
    }
  },
}

Several things are different than our JSON sets to date:

  • The data and supercolumn names are in hex rather than strings.  For example, instead of “Price”, you see “5072696365” in the JSON above. (Use this to try it yourself.)
  • There are extra strings, such as “deletedAt” and “false”.  Fortunately, it appears that these can be faked up.
  • Columns are filed under “subColumns” node within each supercolumn entry.
  • JSON array structures are used in place of hierarchies.

…but it’s not impossible, it’s just different.

Creating JSON2SStable Format Files

Before we continue, we need to wrap our JSON datasets in one more node to represent the datastore – so far our top level has been column families, and the only remaining column family is now “Orders”.

Fortunately we can do this without a special utility: just a few lines of a batch file are needed to add a top-level “Northwind” node.

rem Add an extra wrapper for the name of the datastore
echo { "Northwind" : > JSONout12a.txt
type JSONout12.txt >> JSONout12a.txt
echo } >> JSONout12a.txt

Now we’re finally ready to use a DivConq utility to convert our human-readable JSON into the format needed by Cassandra’s JSON2SStable utility.  This part is easy.

rem Now convert wrapped dataset to json2sstable-ready
rem Cassandra array import format
PrepJSONForSSTableImport JSONout12a.txt JSONout13.txt

Now you should have a new, larger file filled with all the information Cassandra will need for its native import utility.

The Whole Export, Convert and Prep for Import Process

You may have noticed that the DivConq utilities ship with an “exportandimport.bat” file that performs all the steps covered so far.  Running this batch file should generate output like this.

C:\divconq\dotnet>exportandimport
22:38:39 Found expected organization in the "Orders" object.
22:38:39 Found expected organization in the "Order Details" object.
22:38:40 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:40 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:40 Found expected organization in the "Orders" object.
22:38:40 Found expected organization in the "Employees" object.
22:38:41 Completed OK.  Moved 830 children and found 0 orphans.
22:38:41 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:42 Found expected organization in the "Orders" object.
22:38:42 Found expected organization in the "Customers" object.
22:38:43 Completed OK.  Moved 830 children and found 0 orphans.
22:38:43 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:43 Found expected organization in the "Orders" object.
22:38:43 Found expected organization in the "Products" object.
22:38:45 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:45 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:46 Found expected organization in the "Orders" object.
22:38:46 Found expected organization in the "Suppliers" object.
22:38:48 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:48 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:50 Found expected organization in the "Orders" object.
22:38:50 Found expected organization in the "Categories" object.
22:38:52 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:53 Found expected organization in the "Orders" object.
22:38:55 Completed OK.  Moved 11620 children and found 0 orphans.
22:38:57 Completed OK.  Deleted 1 nodes.
22:38:59 Completed OK.  Deleted 9130 nodes.
22:39:01 Completed OK.  Deleted 6640 nodes.
22:39:02 Completed OK.  Deleted 43930 nodes.
22:39:03 Completed OK.  Deleted 4980 nodes.
22:39:08 Completed OK.  Did 39140 nodes.

This batch file and any of its commands can, of course, be modified to taste or to work with other datasets.

Importing Into Cassandra

If you simply run Cassandra’s JSON2SStable command you’ll see some short usage information.

C:\work\apache-cassandra-0.6.3>bin\json2sstable.bat
Missing required options: Kc
Usage: org.apache.cassandra.tools.SSTableImport -K keyspace -c column_family <j
on> <sstable>

…but please use the following procedure to properly import your JSON dataset.

First, shut down your Cassandra client and server (if started).  Then do into your Cassandra folder and open up your “conf\storage-conf.xml” file.  Add the following entry to this file and save.  (You can substitute your own name for “NorthwindOne” as long as you use it consistently below.)

<Keyspace Name="NorthwindOne">
  <ColumnFamily Name="Orders"
     CompareWith="UTF8Type"
     ColumnType="Super" CompareSubcolumnsWith="UTF8Type"
     />
   <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
   <ReplicationFactor>1</ReplicationFactor>
   <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
</Keyspace>

Once you’ve saved this file, start the Cassandra server again.  If your configuration changes were accepted, this will create a new, empty directory in your Cassandra server’s folder store.

You should also fire up the Cassandra client to check that your new datastore is live.

C:\work\apache-cassandra-0.6.3>bin\cassandra-cli --host localhost
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
cassandra> show keyspaces;
NorthwindOne
Keyspace1
system

Now, stop the Cassandra server again and shut down the Cassandra client again.  (The Cassandra client doesn’t respond well to the server going up and down.)

To properly invoke the JSON2SStable utility, use the following syntax, substituting the appropriate values and paths as necessary.

In the example below, “NorthwindOne” is the name of our keystore and must match the value we saved into the “conf\storage-conf.xml” file above.  “Orders” is the name of the new column family we will be creating and inserting our native-formatted JSON into.  The path to the “JSONout13.txt” file is, of course, the file we’re importing.  Finally, the path to the “Orders-1-Data.db” file indicates which Cassandra data file we will create.  Note that this file does not yet exist, but the rest of the path (the folder structure) must already be in place.

C:\work\apache-cassandra-0.6.3>bin\json2sstable.bat -K NorthwindOne -c Orders C
\divconq\dotnet\JSONout13.txt C:\var\lib\cassandra\data\NorthwindOne\Orders-1-D
ata.db

If this works correctly, it will take a few seconds to silently import the data and will then silently return you to the command prompt.  If you see any other output from this command, you encountered an error.

Another way to quickly confirm that data was imported successfully is to eyeball the Cassandra data directory.  This should now contain three new files: Order-1-Data.db, Order-1-Filter.db and Order-1-Index.db.

If you see entries like this, go ahead and fire up your Cassandra server and client again.

Working With Your Imported Data

Finally, it’s time to view the live data on the live Cassandra server. Try these commands first.

C:\work\apache-cassandra-0.6.3>bin\cassandra-cli --host localhost
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
cassandra> show keyspaces;
NorthwindOne
Keyspace1
system
cassandra> get NorthwindOne.Orders['10778']['OrderInformation']
=> (column=ShippedDate, value=12/24/1997 12:00:00 AM, timestamp=1269842588093)
=> (column=ShipVia, value=1, timestamp=1269842588093)
=> (column=ShipRegion, value=, timestamp=1269842588093)
=> (column=ShipName, value=Berglunds snabbk?, timestamp=1269842588093)
=> (column=ShipCity, value=Lule?, timestamp=1269842588093)
=> (column=ShipAddress, value=Berguvsv?gen  8, timestamp=1269842588093)
=> (column=RequiredDate, value=1/13/1998 12:00:00 AM, timestamp=1269842588093)
=> (column=OrderDate, value=12/16/1997 12:00:00 AM, timestamp=1269842588093)
=> (column=Freight, value=6.7900, timestamp=1269842588093)
Returned 9 results.
cassandra> get NorthwindOne.Orders['10778']
=> (super_column=OrderInformation,
     (column=Freight, value=6.7900, timestamp=1269842588093)
     (column=OrderDate, value=12/16/1997 12:00:00 AM, timestamp=1269842588093)
     (column=RequiredDate, value=1/13/1998 12:00:00 AM, timestamp=1269842588093)
     (column=ShipAddress, value=Berguvsv?gen  8, timestamp=1269842588093)
     (column=ShipCity, value=Lule?, timestamp=1269842588093)
     (column=ShipName, value=Berglunds snabbk?, timestamp=1269842588093)
     (column=ShipRegion, value=, timestamp=1269842588093)
     (column=ShipVia, value=1, timestamp=1269842588093)
     (column=ShippedDate, value=12/24/1997 12:00:00 AM, timestamp=1269842588093)
)
=> (super_column=ItemEntry_1393,
     (column=Category_CategoryName, value=Seafood, timestamp=1269842588093)
     (column=Discount, value=0, timestamp=1269842588093)
     (column=ProductID, value=41, timestamp=1269842588093)
     (column=Product_ProductID, value=41, timestamp=1269842588093)
     (column=Product_ProductName, value=Jack's New England Clam Chowder, timesta
mp=1269842588093)
     (column=Product_SupplierID, value=19, timestamp=1269842588093)
     (column=Product_UnitsOnOrder, value=0, timestamp=1269842588093)
     (column=Quantity, value=10, timestamp=1269842588093)
     (column=Supplier_CompanyName, value=New England Seafood Cannery, timestamp=
1269842588093)
     (column=Supplier_Country, value=USA, timestamp=1269842588093)
     (column=Supplier_SupplierID, value=19, timestamp=1269842588093)
     (column=UnitPrice, value=9.6500, timestamp=1269842588093))
=> (super_column=Employee,
     (column=Country, value=USA, timestamp=1269842588093)
     (column=FirstName, value=Janet, timestamp=1269842588093)
     (column=LastName, value=Leverling, timestamp=1269842588093)
     (column=Title, value=Sales Representative, timestamp=1269842588093)
     (column=TitleOfCourtesy, value=Ms., timestamp=1269842588093))
=> (super_column=Customer,
     (column=CompanyName, value=Berglunds snabbk?, timestamp=1269842588093)
     (column=Country, value=Sweden, timestamp=1269842588093))
Returned 4 results.

You can pick other Order IDs and supercolumn values (e.g., “Customer”, “Employee”, various “ItemEntry_” values) to view those values too.

Next Steps

At this point you have the tools and documentation to not only import the Microsoft Northwind Access database into Cassandra, but similar databases as well. This concludes the “Migrate a Relational Database into Cassandra” series of articles.

The next set of articles will describe how to build a working application on top of Cassandra.

Troubleshooting

If you encounter errors during import, feel free to shut down the server and wipe all the data files from the folder.

Also look for entries like this in the “C:\var\log\cassandra\system.log” file; while this specific instance indicates an import problem related to importing extra data through this process, these errors are really telling you that the “Trains-2-Data.*” files are useless but that the older “Trains-1-Data.*” files are still good.

 INFO [main] 2010-09-02 21:48:20,172 SSTableReader.java (line 120) Sampling index for C:\var\lib\cassandra\data\TransSchedTwo\Trains-1-Data.db
 INFO [main] 2010-09-02 21:48:20,177 SSTableReader.java (line 120) Sampling index for C:\var\lib\cassandra\data\TransSchedTwo\Trains-2-Data.db
ERROR [main] 2010-09-02 21:48:20,183 ColumnFamilyStore.java (line 182) Corrupt file C:\var\lib\cassandra\data\TransSchedTwo\Trains-2-Data.db; skipped
 

Posted by on 2010-Sep-09 in Cassandra, Intermediate, JSON, Northwind

Comments Off

Migrate a Relational Database into Cassandra (Part III – Northwind Conversion)

This article describes how to convert a JSON document containing the complete contents of the Northwind Access database into Cassandra-friendly JSON.

Before proceeding, you should understand my previous “Part 2″ article on “Northwind Planning” – this article implements that plan.

You should also have downloaded and installed either the .NET version or Mono version of the DivConq JSON command-line utilities, and should also have a complete JSON document from a conversion of the Northwind database export.  (You can also start with the “NorthwindJSON.txt” document from this archive.)

What To Do?

Copy the following batch file off into the folder that contains your DivConq command-line utilities and your Northwind JSON document.  Then open it using your favorite text editor and skip down to the “Understanding the Flow” section below.

@echo off
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem X
rem X This batch file uses several divconq.com utilities. First it exports an Access DB
rem X called "Northwind" into a single large JSON structure.  Then it manipulates the
rem X JSON structure until it conforms to a denormalized form suitable for insertion
rem X into a Cassandra DB.  Finally, it imports the JSON structure into a Cassandra DB.
rem X
rem X August 11, 2010 - Jonathan Lampe - jonathan.lampe@divconq.com
rem X
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the basic schema of the original DB
rem .
rem Categories
rem Customers
rem Employees
rem Order Details
rem   ProductID
rem   OrderID
rem Orders
rem   CustomerID
rem   EmployeeID
rem Products
rem   SupplierID
rem   CategoryID
rem Suppliers
rem .
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the data schema represented with 1:N and N:1 relationships
rem
rem    Customer }                             { Supplier
rem    Employee } 1:N Orders N:N Products N:1 { Category
rem
rem ...where the N:N relationship is expressed in the "Order Details" table
rem
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the denormalized, Cassandra-friendly schema that is constructed:
rem .
rem Keyspace=Northwind
rem   ColumnFamily=Orders
rem     RowKey=OrderID
rem       1 SuperColumn="OrderInformation"
rem         Columns=(contents from one Orders entry)
rem       1 SuperColumn="Customer"
rem         Columns=(contents from one Customers entry)
rem       1 SuperColumn="Employee"
rem         Columns=(contents from one Customers entry)
rem       0-N SuperColumns="ItemEntry_"+[NumericCount]+"
rem         Columns=(contents from one Order Details entry) +
rem                 (contents from one Product entry)
rem                 (contents from one Supplier entry)
rem                 (contents from one Category entry)
rem .
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Bring in 0-N item entries per order, each as a new supercolumn
consolidatejsonobjects NorthwindJSON.txt JSONout.txt Orders "Order Details" ItemEntry_[ID] "OrderID" yes yes 4
rem .
rem Bring in exactly one employee per order as a new supercolumn
distributejsonobjects JSONout.txt JSONout2.txt Orders "Employees" Employee "EmployeeID" yes yes 4
rem .
rem Bring in exactly one customer per order as a new supercolumn
distributejsonobjects JSONout2.txt JSONout3.txt Orders "Customers" Customer "CustomerID" yes yes 4
rem .
rem Bring in exactly one product entry per item entry as new columns in existing supercolumn
distributejsonobjects2 JSONout3.txt JSONout4.txt Orders ItemEntry_* Products Product_ ProductID yes yes 4
rem .
rem Bring in one supplier entry and one category entry per item entry as new columns in existing supercolumn
rem   (note these depend on previously imported Product entries)
distributejsonobjects2 JSONout4.txt JSONout5.txt Orders ItemEntry_* Suppliers Supplier_ Product_SupplierID yes yes 4
distributejsonobjects2 JSONout5.txt JSONout6.txt Orders ItemEntry_* Categories Category_ Product_CategoryID yes yes 4
rem .
rem Demote remaining rowkey-columns into a new supercolumn
rem .
demotejsonpairsintonewobject JSONout6.txt JSONout7.txt Orders OrderInformation yes yes 4

Understanding the Flow

The top part of the batch file simply reinterates what we planned in the previous article.  The end is where all the processing actually takes place.  Within the “rem”ed lines there are 7 JSON conversions, each performing a single piece of the consolidation.

Step 1: We open the list of “Order Details” and pull in zero, one or multiple entries as new “ItemEntry_[ID]” nodes.

Steps 2 and 3: We reach out and import exactly one Customer node and one Employee node into our master Order record.

(Note that Steps 1, 2 and 3 can really occur in any order because these three entities are completely independent.  However, “Order Details” must have been imported before Step 4 can take place.)

Step 4: We reach out and import the contents of exactly one Product node into each existing ItemEntry node by looking up ProductIDs from existing ItemEntry name/value pairs.

(Note that Step 4 – the Product import – must take place before Steps 5-6 can take place.)

Steps 5-6: We reach out and import the contents of exactly one Category node and exactly one Supplier node into each existing ItemEntry node by looking up CategoryIDs and SupplierIDs from recently imported ItemEntry > Product information.

Step 7: Consolidate the remaining entries under the main Order entry into a new OrderInformation node.

(Step 7 – the consolidation of hanging entries – should always go last in case previous steps depend on keys in their original positions.)

To see complete sample output from each step of the conversion (as generated by the .NET utilities), unpack the archive here.

Running the Batch File

You may need to change the name of the original “NorthwindJSON.txt”  and/or may want to change the name of the final output file (from “JSONout7.txt”), but other than that the batch file should be ready to run.  To capture log and error output so it’s easier to read, use this command:

D:\divconq\dotnet>ConsolidateNorthwindJSON.bat > log.txt 2> err.txt

Next Steps

The next article in this series shows how to import the complete JSON document into Cassandra.

More About The Utilities

ConsolidateJSONObjects is used to dive into a foreign table used to express an N:N relationship from the point of view of one entity in the relationship.  It generates 0-N supercolumn-level JSON nodes for each foreign node discovered.  This utility is used once during the Northwind conversion to bring in 0-N item entries per order.

consolidatejsonobjects NorthwindJSON.txt JSONout.txt Orders "Order Details" ItemEntry_[ID] "OrderID" yes yes 4

DistributeJSONObjects and DistributeJSONObjects2 are used to select foreign keys in the main table and look them up against a foreign table: it works N:1 relationships and brings in exactly one node for each foreign key scanned (assuming DB consistency, of course).   The difference between these utilities is that DistributeJSONObjects imports discovered nodes as supercolumn-level nodes while DistributeJSONObjects2 imports the contents of discovered nodes as column-level name-value pairs under an existing supercolumn-level node.  These utilities are used five times during the Northwind conversion: two invocations of DistributeJSONObjects to import 1-per-order Employee and Customer nodes as supercolumns and three invocations of DistributeJSONObjects2 to import the contents of 1-per-itementry Product, Supplier and Category nodes.

rem Bring in exactly one employee per order as a new supercolumn
distributejsonobjects JSONout.txt JSONout2.txt Orders "Employees" Employee "EmployeeID" yes yes 4
rem .
rem Bring in exactly one customer per order as a new supercolumn
distributejsonobjects JSONout2.txt JSONout3.txt Orders "Customers" Customer "CustomerID" yes yes 4
rem .
rem Bring in exactly one product entry per item entry as new columns in existing supercolumn
distributejsonobjects2 JSONout3.txt JSONout4.txt Orders ItemEntry_* Products Product_ ProductID yes yes 4
rem .
rem Bring in one supplier entry and one category entry per item entry as new columns in existing supercolumn
rem   (note these depend on previously imported Product entries)
distributejsonobjects2 JSONout4.txt JSONout5.txt Orders ItemEntry_* Suppliers Supplier_ Product_SupplierID yes yes 4
distributejsonobjects2 JSONout5.txt JSONout6.txt Orders ItemEntry_* Categories Category_ Product_CategoryID yes yes 4

DemoteJSONPairsIntoNewObject is simply used to clean up lingering columns (name/value pairs) at the “row” level and put them into a new supercolumn-level object.   One invocation of this utility is used during the Northwind conversion.

demotejsonpairsintonewobject JSONout6.txt JSONout7.txt Orders OrderInformation yes yes 4

All utilities will display additional help, including complete documentation for each parameter at the command-line if invoked without any command-line parameters.  (e.g., simply execute “consolidatejsonobjects.exe” to find out what the eighth parameter – “yes” in the example above – means)

Troubleshooting

If you are experimenting with these utilities and data sets of any size it is highly recommended that you put all your commands in a batch file.  Then, from the command prompt, run your batch file like this:

D:\divconq\dotnet>divconq_test23.bat > log.txt 2> err.txt

…to capture logging output and error output in two separate files.

 

Posted by on 2010-Aug-08 in Cassandra, Intermediate, JSON, Northwind

Comments Off

Migrate a Relational Database into Cassandra (Part II – Northwind Planning)

This article describes how to fit a relational database structure containing an N:N relationship and several N:1 relationships into a Cassandra-friendly structure.

You should be familiar with the concepts in my “Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)” article before proceeding.

In this article we are going to examine the “Northwind” database shipped as an example with Microsoft Access.  Having a local copy of this database not needed to understand this article, but if you are interesting in obtaining a copy (and/or want to see how to extract it to JSON), visit this article and then return.

Northwind Database Schema

The relationships of the various tables that make up the Northwind Access database can be summed up like this:

Customer }                            { Supplier
Employee } 1:N Order N:N Products N:1 { Category

In other words:

  • Every order is tied to a single customer and a single employee
  • Every product is tied to a single supplier and a single category
  • Every order can have a variety of different products
  • Every product can be part of a variety of different orders

Notice that I’m ignoring the “N” part of the customer, employee, supplier and category relationships for now.  I’m also ignoring the “Order Details” table that provides the links necessary to implement the “N:N” relationship depicted above.

Instead, I’m focusing on the important “N:N” relationship between order and product.  In a relational database setting I can often simply normalize away and worry about which table I need to focus my indexing efforts on later when I’m working in the application.  However, in NoSQL, non-relational database design, we often need to decide up front which entity most queries will be interested in and build everything else around that entity.

So…will it be “order” or “product”?  Today I’ll decide that the key entity in this database is “order” – customers will be hitting this on a daily, per transaction basis whereas I can probably run my product reports offline.

Sketching Out a Viable Cassandra Schema for the Northwind Database

OK, so “order” is our key element.  Let’s see how we can squeeze that into a Cassandra-friendly data structure.  First, let’s review Cassandra’s keystore->column family->row->super column->column data structure.  (If you need more of a refresher, see this.)

{
  "keystore" : {
    "column family" : {
      "row key" : {
        "supercolumn key" : {
          "column name" : "value"
        }
      }
    }
  }
}

Remember that the first two levels of this structure (keystore->column family) will essentially be burned up with the name of the database and the name of table.  That leaves us one level (row) for each “order” entry and one level (supercolumn) for each subcomponent of each “order” entry (such as a block of employee information).  The remaining level (column) has to be used for the name/value pairs that contain our information.

That’s a tight fit, but not an impossible one.  Let’s get started on the side of the relational diagram that only contains N:1 relationships.

Customer }
Employee } 1:N Order

Again, we’re going to ignore the fact that a single customer can be tied to multiple orders and that a single employee can also be tied to multiple orders.  Instead, we’re going to focus on the fact that a single order has one customer and one employee – period.

Here’s a data structure that accommodates that line of thinking.

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "OrderInformation" : {
          "column name" : "value"
        }
        "Employee" : {
          "column name" : "value"
        }
        "Customer" : {
          "column name" : "value"
        }
      }
    }
  }
}

So far, so good.  But where did that extra “OrderInformation” entry come from?  The answer is that we demoted the original information from the original order record so that column name/value pairs are always at the same level in our Cassandra database.  (More on that.)

Tackling the N:N Relationship

Now what should we do about the “N:N” relationship from orders to products?

                       { Supplier
Order N:N Products N:1 { Category

We could call each N:N Order:Product relationship an “ItemEntry” and consider doing something like this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntry_ID]" : {
          "ItemInformation" : {
            "column name" : "value"
          }
          "Product" : {
            "column name" : "value"
            "Category" : {
               "column name" : "value"
             }
            "Supplier" : {
               "column name" : "value"
             }
          }
        }
      }
    }
  }
}

…but that would be illegal because it requires two more levels than Cassandra allows (e.g., “Product > Category”).

Instead we need a structure that looks more like this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntryID]_ItemInformation" : {
            "column name" : "value"
          }
          "[ItemEntryID]_Product" : {
            "column name" : "value"
          }
          "[ItemEntryID]_ProductSupplier" : {
            "column name" : "value"
          }
          "[ItemEntryID]_ProductCategory" : {
            "column name" : "value"
          }
        }
      }
    }
  }
}

…or this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntryID]" : {
            "ItemInformation_column1" : "value",
            "ItemInformation_column2" : "value",
            "Product_column1" : "value",
            "Product_column2" : "value",
            "ProductSupplier_column1" : "value",
            "ProductSupplier_column2" : "value",
            "ProductCategory_column1" : "value",
            "ProductCategory_column2" : "value"
          }
        }
      }
    }
  }
}

It looks good on paper but there’s an extra problem with the “Order Details” database we’re using to provide the N:N relationship in the Northwind database.  The extra problem is that there are no unique IDs on the Order Detail entries.  With that in mind, we’ll need to take special care during export/consolidations of legacy Order Detail data and with new entries in the future to automatically populate unique entries for each Order Detail as we add it.  (This isn’t a big deal – we could either go with unique counts – e.g., 1, 2, 3… – within each Order entry or commit to the use of GUIDs for the same purpose)  This exercise uses unique IDs added to the original Northwind JSON document through the use of the “ID4Tables” option in the ExportAccessDBToJSON utility.

With all that in place, here is our final, Cassandra-ready data structure with an “order”-centric focus and all the information from our original Northwind relational structure.

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "OrderInformation" : {
          "column name" : "value"
        }
        "Employee" : {
          "column name" : "value"
        }
        "Customer" : {
          "column name" : "value"
        }
        "ItemEntry_[ItemEntryID]" : {
            "ItemInformation_column1" : "value",
            "ItemInformation_column2" : "value",
            "Product_column1" : "value",
            "Product_column2" : "value",
            "ProductSupplier_column1" : "value",
            "ProductSupplier_column2" : "value",
            "ProductCategory_column1" : "value",
            "ProductCategory_column2" : "value"
        }
      }
    }
  }
}

Next Steps

Next I will show exactly how to use DivConq utilities to implement our plan and convert our Northwind JSON to a Cassandra-ready format.

 

Posted by on 2010-Jul-07 in Cassandra, Intermediate, Northwind

Comments Off

Utility to Denormalize JSON Structures by Consolidating Two Object Trees

This article is about a DivConq utility that denormalizes JSON structures by consolidating two object trees at a time.

If you need to learn about JSON first, start here.  If you’re not sure why you’d want to denormalize JSON structures, read this article first.

Downloading and Installing

This utility was written in C# and cross-compiled in for .NET and Mono.  These compilations are identical except:

  • The .NET compilation supports console text colors – the Mono compilation does not
  • The Mono compilation does not (yet) support “utf32″ encoding

To get the .NET executables, download this URL:
http://www.divconq.com/wp-content/uploads/2013/05/JSONUtils_dotnet_v1001.zip

To get the Mono executables, download this URL:
http://www.divconq.com/wp-content/uploads/2013/05/JSONUtils_mono_v1001.zip

After downloading this Zip file, unpack it into a local directory and run your choice of “dotnet” or “mono” executable (each is unpacked to a folder of that name).

Command Syntax

ConsolidateJSONObjects v1.0 - July 26, 2010
  Denormalizes/distributes one JSON object into another.
  Please visit http://www.divconq.com for more information.
  This utility uses the Procurios C#.NET JSON library under the MIT license.

  Usage:
   ConsolidateJSONObjects InputFile OutputFile MergeIntoObject MergeFromObject
     MergeAsName ForeignKey [DeleteFrom] [FormatHack] [DebugLevel]
     ...where:
       InputFile = name or path of input file containing original JSON
       OutputFile = name or path of output file to write final JSON into
       MergeIntoObject = name of JSON object into which object will be merged
       MergeFromObject = name of JSON object to read data from for merge
       MergeAsName = name template of new grandchild objects created in the
           MergeIntoObject - this parameter uses a [ID] macro, e.g., car_[ID]
       ForeignKey = name of column which contains the foreign key value used
           to connect MergeFromObject children with MergeToObject children
       DeleteFrom = Set to NO to preserve MergeFromObject in OutputFile
           Default is YES, which causes MergeFromObject to be deleted.
       FormatHack = Set to NO to prevent utility from injecting CRLF to
           make the output prettier.  Default is YES.
       DebugLevel = Set to 1 to see only warnings, set to 0 to hide those
           too.  Default is 2.  Set to 3 to see a little more information.

  Example: (the following two invocations do exactly the same thing)
   ConsolidateJSONObjects forestandtrees.txt justforests.txt forests
     famoustrees famoustree_[ID] forestID
   ConsolidateJSONObjects forestandtrees.txt justforests.txt forests
     famoustrees famoustree_[ID] forestID YES YES 2

Short Example of Use

Now is a good time to go back and re-read the “Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)” article.  This is exactly what we’re going to do with the ConsolidateJSONObjects utility in this example.

To get started, download the input data file containing the “forests” and “famoustrees” object trees from this URL.  To conserve keystrokes,  save this file in the same directory as either your mono or dotnet edition of ConsolidateJSONObjects.exe.

Next, open a console or terminal and CD into the directory with your  “ConsolidateJSONObjects.exe” executable and the “forestandtrees.txt” file you just downloaded.  Run “ConsolidateJSONObjects.exe” without any arguments – expect to see the console help for the utility displayed.

Now let’s try it again with all six of the required arguments filled out:

  • Input and output JSON files: pretty self-explanatory
  • The “MergeInto” object is where we want all our data to end up.  In this case it’s the “forests” object.
  • The “MergeFrom” object is the object we want to collapse.  In this case it’s the “famoustrees” object.
  • “MergeAsName” is completely up to you: it determines the name/key of the new object added to each child of the MergeInto object.  Unless you’re merging “1:1″ object relationships, this value should contain the string “[ID]”, which will help ensure that object names/keys rename unique in their parent nodes.
  • “ForeignKey” is the name/key of existing grandchildren of the MergeFrom object and is used to link the children of the MergeInto object with the children from the MergeFrom object

The expected output from this command is shown below.  Notice the one warning – this indicates that there was a “famoustree” that could not be chained up to specific “forest”.

C:\jsontools\dotnet>consolidatejsonobjects forestandtrees.txt JSONout.txt forests famoustrees famoustree_[ID] forestID
00:10:38 Found expected organization in the “forests” object.
00:10:38 Found expected organization in the “famoustrees” object.
00:10:38 WARNING: tree23456 Foreign Key=forest002 not found in target JSON struc
ture!
00:10:38 Completed OK. Moved 3 children and found 1 orphans.

Open the “JSONout.txt” file to view the output.  You’ll notice that it’s not as pretty as my hand-formatted input, but the utility revalidates output JSON before writing it so at least you know it’s real JSON.

Take a close look at the “Black Forest” forest entry and notice that it’s grown two “famoustree_*” children, each with a unique name derived from it’s unique key from the old famoustree object.  Also notice that the whole famoustree object is missing from the output – even the entry that generated the warning because it couldn’t be matched was deleted.

If you open the original “forestandtrees.txt” JSON file you should be able to find the source of the warning by searching for the value of “tree23456″.  You can practice taking out this node or adding a stub for the proper forests entry to see how the utility responds, or test it with other data of your choice.

Once you are comfortable with the way the basic six parameters work and consolidation happens, move on and experiment with the three optional parameters.

First, add a value of “NO” to the end of your six values in the seventh slot (the “DeleteFrom” parameter) to prevent the MergeFrom object (e.g., “famoustrees”) from being deleted.  You might use this in a denormalization scenario where you need to consolidate multiple sets of objects (though multiple invocations of this utility) and you need to keep the MergeFrom object around until the final merge.

Next, add a second value of “NO” in the eighth slot (the “FormatHack ” parameter) to prevent the utility from trying to add some whitespace and line feeds to your JSON output.  You would use this if the utility’s attempts to make the JSON output pretty were messing up the data in the JSON output or yielding invalid JSON.  (The format hack also incurs a small, usually negligible performance hit; you could use this flag to avoid that too.)

Finally, try different values from 0-3 in the ninth slot (the “DebugLevel” parameter).  Smaller values yield less output and the utility is currently not coded to care about any value larger than 3.

Next Steps

The next article in this series will show you how to use this utility and others like it to prepare JSON output from the Northwind relational database for import into Cassandra.

 

Posted by on 2010-Jul-07 in Intermediate, JSON

Comments Off