RSS
 

Archive for the ‘JSON’ Category

Migrate a Relational Database into Cassandra (Part IV – Northwind Import)

11 Sep

This article shows how to prepare and import a dataset expressed in Cassandra-friendly JSON into a Cassandra datastore using Cassandra’s “json2sstable” utility.

Before proceeding, you should understand my previous “Part 3″ article on “Northwood Conversion” – this article imports the JSON dataset created in that article.

You should also have downloaded and installed either the .NET version or Mono version of the DivConq JSON command-line utilities, and should also have a complete JSON document from a conversion of the Northwind database export. (You can also start with the “JSONout7.txt” document from this archive.)

Cleaning the Data

So far everything we’ve done has simply moved data around.  This has led to a JSON structure that contains everything and then some from the original relational database.  We could import that, but from here on out we’ll treat this data more like a traditional data warehouse by only working with a subset of the original data.

To do the stripping, we can use a DivConq utility called “StripNodesFromJSON”.   The following batch snippet cuts out extra nodes (like “Shippers”) and tags (like “Phone”).

rem Let us turn this structure into something we can use in a data warehouse
rem Strip a lot of the extra tags out
rem x Get rid of the Shippers node
StripNodesFromJSON JSONout7.txt JSONout8.txt Shippers
rem x Get rid of extra nodes from the Employee node
StripNodesFromJSON JSONout8.txt JSONout9.txt PostalCode Photo Address ReportsTo HireDate HomePhone Notes BirthDate Extension
rem x Get rid of extra nodes from the Customer node
StripNodesFromJSON JSONout9.txt JSONout10.txt City Phone Region ContactTitle Address PostalCode Fax ContactName
rem x Get rid of extra nodes from the ItemEntry nodes
StripNodesFromJSON JSONout10.txt JSONout11.txt OrderID Product_UnitPrice Product_UnitsInStock Product_QuantityPerUnit Product_ReorderLevel Supplier_City Supplier_Region "Order Details_AutoID" Product_CategoryID Supplier_ContactTitle Supplier_ContactName Product_Discontinued Supplier_HomePage Supplier_PostalCode Supplier_Address Category_CategoryID  Category_Picture Category_Description Supplier_Fax Supplier_Phone
rem x Get rid of extra nodes from OrderInformation
StripNodesFromJSON JSONout11.txt JSONout12.txt OrderID, ShipPostalCode ShipCountry CustomerID EmployeeID

If you do a directory listing on the intermediate files created in this batch file you should see that each one is smaller than the one before it.

Cassandra’s JSON2SStable Format

If you’ve worked with Cassandra’s SStable2JSON utility, you’ve seen that the format Cassandra uses for its JSON datasets is not human-readable.

Cassandra’s SStable2JSON utility will export plain (no supercolumn) Column Families like this:

{
  "HotWheelsCar": [
    ["5072696365", "312e3439", 1278132336497000, false],
    ["53656374696f6e", "56656869636c6573", 1278132515996000, false]
  ],
  "GumDrop": [
    ["5072696365", "302e3235", 1278132306875000, false],
    ["53656374696f6e", "43616e6479", 1278132493790000, false]
  ]
}

…and will export supercolumn-filled Column Families like this:

{
  "ultralights": {
    "756c3233": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["7365617432", "392070656f706c65", 1283394499763000, false]
      ]
    }
  },
  "planes": {
    "706c616e65313436": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["726f773138", "372070656f706c65", 1283394371843000, false],
        ["726f773237", "322070656f706c65", 1283394387348000, false]
      ]
    },
    "706c616e65353436": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["726f773232", "332070656f706c65", 1283394349929000, false]
      ]
    }
  },
}

Several things are different than our JSON sets to date:

  • The data and supercolumn names are in hex rather than strings.  For example, instead of “Price”, you see “5072696365″ in the JSON above. (Use this to try it yourself.)
  • There are extra strings, such as “deletedAt” and “false”.  Fortunately, it appears that these can be faked up.
  • Columns are filed under “subColumns” node within each supercolumn entry.
  • JSON array structures are used in place of hierarchies.

…but it’s not impossible, it’s just different.

Creating JSON2SStable Format Files

Before we continue, we need to wrap our JSON datasets in one more node to represent the datastore – so far our top level has been column families, and the only remaining column family is now “Orders”.

Fortunately we can do this without a special utility: just a few lines of a batch file are needed to add a top-level “Northwind” node.

rem Add an extra wrapper for the name of the datastore
echo { "Northwind" : > JSONout12a.txt
type JSONout12.txt >> JSONout12a.txt
echo } >> JSONout12a.txt

Now we’re finally ready to use a DivConq utility to convert our human-readable JSON into the format needed by Cassandra’s JSON2SStable utility.  This part is easy.

rem Now convert wrapped dataset to json2sstable-ready
rem Cassandra array import format
PrepJSONForSSTableImport JSONout12a.txt JSONout13.txt

Now you should have a new, larger file filled with all the information Cassandra will need for its native import utility.

The Whole Export, Convert and Prep for Import Process

You may have noticed that the DivConq utilities ship with an “exportandimport.bat” file that performs all the steps covered so far.  Running this batch file should generate output like this.

C:\divconq\dotnet>exportandimport
22:38:39 Found expected organization in the "Orders" object.
22:38:39 Found expected organization in the "Order Details" object.
22:38:40 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:40 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:40 Found expected organization in the "Orders" object.
22:38:40 Found expected organization in the "Employees" object.
22:38:41 Completed OK.  Moved 830 children and found 0 orphans.
22:38:41 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:42 Found expected organization in the "Orders" object.
22:38:42 Found expected organization in the "Customers" object.
22:38:43 Completed OK.  Moved 830 children and found 0 orphans.
22:38:43 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:43 Found expected organization in the "Orders" object.
22:38:43 Found expected organization in the "Products" object.
22:38:45 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:45 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:46 Found expected organization in the "Orders" object.
22:38:46 Found expected organization in the "Suppliers" object.
22:38:48 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:48 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:50 Found expected organization in the "Orders" object.
22:38:50 Found expected organization in the "Categories" object.
22:38:52 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:53 Found expected organization in the "Orders" object.
22:38:55 Completed OK.  Moved 11620 children and found 0 orphans.
22:38:57 Completed OK.  Deleted 1 nodes.
22:38:59 Completed OK.  Deleted 9130 nodes.
22:39:01 Completed OK.  Deleted 6640 nodes.
22:39:02 Completed OK.  Deleted 43930 nodes.
22:39:03 Completed OK.  Deleted 4980 nodes.
22:39:08 Completed OK.  Did 39140 nodes.

This batch file and any of its commands can, of course, be modified to taste or to work with other datasets.

Importing Into Cassandra

If you simply run Cassandra’s JSON2SStable command you’ll see some short usage information.

C:\work\apache-cassandra-0.6.3>bin\json2sstable.bat
Missing required options: Kc
Usage: org.apache.cassandra.tools.SSTableImport -K keyspace -c column_family <j
on> <sstable>

…but please use the following procedure to properly import your JSON dataset.

First, shut down your Cassandra client and server (if started).  Then do into your Cassandra folder and open up your “conf\storage-conf.xml” file.  Add the following entry to this file and save.  (You can substitute your own name for “NorthwindOne” as long as you use it consistently below.)

<Keyspace Name="NorthwindOne">
  <ColumnFamily Name="Orders"
     CompareWith="UTF8Type"
     ColumnType="Super" CompareSubcolumnsWith="UTF8Type"
     />
   <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
   <ReplicationFactor>1</ReplicationFactor>
   <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
</Keyspace>

Once you’ve saved this file, start the Cassandra server again.  If your configuration changes were accepted, this will create a new, empty directory in your Cassandra server’s folder store.

You should also fire up the Cassandra client to check that your new datastore is live.

C:\work\apache-cassandra-0.6.3>bin\cassandra-cli --host localhost
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
cassandra> show keyspaces;
NorthwindOne
Keyspace1
system

Now, stop the Cassandra server again and shut down the Cassandra client again.  (The Cassandra client doesn’t respond well to the server going up and down.)

To properly invoke the JSON2SStable utility, use the following syntax, substituting the appropriate values and paths as necessary.

In the example below, “NorthwindOne” is the name of our keystore and must match the value we saved into the “conf\storage-conf.xml” file above.  “Orders” is the name of the new column family we will be creating and inserting our native-formatted JSON into.  The path to the “JSONout13.txt” file is, of course, the file we’re importing.  Finally, the path to the “Orders-1-Data.db” file indicates which Cassandra data file we will create.  Note that this file does not yet exist, but the rest of the path (the folder structure) must already be in place.

C:\work\apache-cassandra-0.6.3>bin\json2sstable.bat -K NorthwindOne -c Orders C
\divconq\dotnet\JSONout13.txt C:\var\lib\cassandra\data\NorthwindOne\Orders-1-D
ata.db

If this works correctly, it will take a few seconds to silently import the data and will then silently return you to the command prompt.  If you see any other output from this command, you encountered an error.

Another way to quickly confirm that data was imported successfully is to eyeball the Cassandra data directory.  This should now contain three new files: Order-1-Data.db, Order-1-Filter.db and Order-1-Index.db.

If you see entries like this, go ahead and fire up your Cassandra server and client again.

Working With Your Imported Data

Finally, it’s time to view the live data on the live Cassandra server. Try these commands first.

C:\work\apache-cassandra-0.6.3>bin\cassandra-cli --host localhost
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
cassandra> show keyspaces;
NorthwindOne
Keyspace1
system
cassandra> get NorthwindOne.Orders['10778']['OrderInformation']
=> (column=ShippedDate, value=12/24/1997 12:00:00 AM, timestamp=1269842588093)
=> (column=ShipVia, value=1, timestamp=1269842588093)
=> (column=ShipRegion, value=, timestamp=1269842588093)
=> (column=ShipName, value=Berglunds snabbk?, timestamp=1269842588093)
=> (column=ShipCity, value=Lule?, timestamp=1269842588093)
=> (column=ShipAddress, value=Berguvsv?gen  8, timestamp=1269842588093)
=> (column=RequiredDate, value=1/13/1998 12:00:00 AM, timestamp=1269842588093)
=> (column=OrderDate, value=12/16/1997 12:00:00 AM, timestamp=1269842588093)
=> (column=Freight, value=6.7900, timestamp=1269842588093)
Returned 9 results.
cassandra> get NorthwindOne.Orders['10778']
=> (super_column=OrderInformation,
     (column=Freight, value=6.7900, timestamp=1269842588093)
     (column=OrderDate, value=12/16/1997 12:00:00 AM, timestamp=1269842588093)
     (column=RequiredDate, value=1/13/1998 12:00:00 AM, timestamp=1269842588093)
     (column=ShipAddress, value=Berguvsv?gen  8, timestamp=1269842588093)
     (column=ShipCity, value=Lule?, timestamp=1269842588093)
     (column=ShipName, value=Berglunds snabbk?, timestamp=1269842588093)
     (column=ShipRegion, value=, timestamp=1269842588093)
     (column=ShipVia, value=1, timestamp=1269842588093)
     (column=ShippedDate, value=12/24/1997 12:00:00 AM, timestamp=1269842588093)
)
=> (super_column=ItemEntry_1393,
     (column=Category_CategoryName, value=Seafood, timestamp=1269842588093)
     (column=Discount, value=0, timestamp=1269842588093)
     (column=ProductID, value=41, timestamp=1269842588093)
     (column=Product_ProductID, value=41, timestamp=1269842588093)
     (column=Product_ProductName, value=Jack's New England Clam Chowder, timesta
mp=1269842588093)
     (column=Product_SupplierID, value=19, timestamp=1269842588093)
     (column=Product_UnitsOnOrder, value=0, timestamp=1269842588093)
     (column=Quantity, value=10, timestamp=1269842588093)
     (column=Supplier_CompanyName, value=New England Seafood Cannery, timestamp=
1269842588093)
     (column=Supplier_Country, value=USA, timestamp=1269842588093)
     (column=Supplier_SupplierID, value=19, timestamp=1269842588093)
     (column=UnitPrice, value=9.6500, timestamp=1269842588093))
=> (super_column=Employee,
     (column=Country, value=USA, timestamp=1269842588093)
     (column=FirstName, value=Janet, timestamp=1269842588093)
     (column=LastName, value=Leverling, timestamp=1269842588093)
     (column=Title, value=Sales Representative, timestamp=1269842588093)
     (column=TitleOfCourtesy, value=Ms., timestamp=1269842588093))
=> (super_column=Customer,
     (column=CompanyName, value=Berglunds snabbk?, timestamp=1269842588093)
     (column=Country, value=Sweden, timestamp=1269842588093))
Returned 4 results.

You can pick other Order IDs and supercolumn values (e.g., “Customer”, “Employee”, various “ItemEntry_” values) to view those values too.

Next Steps

At this point you have the tools and documentation to not only import the Microsoft Northwind Access database into Cassandra, but similar databases as well. This concludes the “Migrate a Relational Database into Cassandra” series of articles.

The next set of articles will describe how to build a working application on top of Cassandra.

Troubleshooting

If you encounter errors during import, feel free to shut down the server and wipe all the data files from the folder.

Also look for entries like this in the “C:\var\log\cassandra\system.log” file; while this specific instance indicates an import problem related to importing extra data through this process, these errors are really telling you that the “Trains-2-Data.*” files are useless but that the older “Trains-1-Data.*” files are still good.

 INFO [main] 2010-09-02 21:48:20,172 SSTableReader.java (line 120) Sampling index for C:\var\lib\cassandra\data\TransSchedTwo\Trains-1-Data.db
 INFO [main] 2010-09-02 21:48:20,177 SSTableReader.java (line 120) Sampling index for C:\var\lib\cassandra\data\TransSchedTwo\Trains-2-Data.db
ERROR [main] 2010-09-02 21:48:20,183 ColumnFamilyStore.java (line 182) Corrupt file C:\var\lib\cassandra\data\TransSchedTwo\Trains-2-Data.db; skipped
 
Comments Off

Posted in Cassandra, Intermediate, JSON, Northwind

 

Resolving JSON data types in Cassandra

02 Sep

In my previous entry I ask Why does Apache Cassandra have data types? So now that the rationale has been covered, what does that mean to us?

At DivConq we like to move data around with JSON, as you can probably tell. So the question arises how do we want JSON data types to work with Cassandra?

Read the rest of this entry »

 
Comments Off

Posted in Cassandra, JSON, MUMPS

 

Migrate a Relational Database into Cassandra (Part III – Northwind Conversion)

16 Aug

This article describes how to convert a JSON document containing the complete contents of the Northwind Access database into Cassandra-friendly JSON.

Before proceeding, you should understand my previous “Part 2″ article on “Northwind Planning” – this article implements that plan.

You should also have downloaded and installed either the .NET version or Mono version of the DivConq JSON command-line utilities, and should also have a complete JSON document from a conversion of the Northwind database export.  (You can also start with the “NorthwindJSON.txt” document from this archive.)

What To Do?

Copy the following batch file off into the folder that contains your DivConq command-line utilities and your Northwind JSON document.  Then open it using your favorite text editor and skip down to the “Understanding the Flow” section below.

@echo off
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem X
rem X This batch file uses several divconq.com utilities. First it exports an Access DB
rem X called "Northwind" into a single large JSON structure.  Then it manipulates the
rem X JSON structure until it conforms to a denormalized form suitable for insertion
rem X into a Cassandra DB.  Finally, it imports the JSON structure into a Cassandra DB.
rem X
rem X August 11, 2010 - Jonathan Lampe - jonathan.lampe@divconq.com
rem X
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the basic schema of the original DB
rem .
rem Categories
rem Customers
rem Employees
rem Order Details
rem   ProductID
rem   OrderID
rem Orders
rem   CustomerID
rem   EmployeeID
rem Products
rem   SupplierID
rem   CategoryID
rem Suppliers
rem .
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the data schema represented with 1:N and N:1 relationships
rem
rem    Customer }                             { Supplier
rem    Employee } 1:N Orders N:N Products N:1 { Category
rem
rem ...where the N:N relationship is expressed in the "Order Details" table
rem
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the denormalized, Cassandra-friendly schema that is constructed:
rem .
rem Keyspace=Northwind
rem   ColumnFamily=Orders
rem     RowKey=OrderID
rem       1 SuperColumn="OrderInformation"
rem         Columns=(contents from one Orders entry)
rem       1 SuperColumn="Customer"
rem         Columns=(contents from one Customers entry)
rem       1 SuperColumn="Employee"
rem         Columns=(contents from one Customers entry)
rem       0-N SuperColumns="ItemEntry_"+[NumericCount]+"
rem         Columns=(contents from one Order Details entry) +
rem                 (contents from one Product entry)
rem                 (contents from one Supplier entry)
rem                 (contents from one Category entry)
rem .
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Bring in 0-N item entries per order, each as a new supercolumn
consolidatejsonobjects NorthwindJSON.txt JSONout.txt Orders "Order Details" ItemEntry_[ID] "OrderID" yes yes 4
rem .
rem Bring in exactly one employee per order as a new supercolumn
distributejsonobjects JSONout.txt JSONout2.txt Orders "Employees" Employee "EmployeeID" yes yes 4
rem .
rem Bring in exactly one customer per order as a new supercolumn
distributejsonobjects JSONout2.txt JSONout3.txt Orders "Customers" Customer "CustomerID" yes yes 4
rem .
rem Bring in exactly one product entry per item entry as new columns in existing supercolumn
distributejsonobjects2 JSONout3.txt JSONout4.txt Orders ItemEntry_* Products Product_ ProductID yes yes 4
rem .
rem Bring in one supplier entry and one category entry per item entry as new columns in existing supercolumn
rem   (note these depend on previously imported Product entries)
distributejsonobjects2 JSONout4.txt JSONout5.txt Orders ItemEntry_* Suppliers Supplier_ Product_SupplierID yes yes 4
distributejsonobjects2 JSONout5.txt JSONout6.txt Orders ItemEntry_* Categories Category_ Product_CategoryID yes yes 4
rem .
rem Demote remaining rowkey-columns into a new supercolumn
rem .
demotejsonpairsintonewobject JSONout6.txt JSONout7.txt Orders OrderInformation yes yes 4

Understanding the Flow

The top part of the batch file simply reinterates what we planned in the previous article.  The end is where all the processing actually takes place.  Within the “rem”ed lines there are 7 JSON conversions, each performing a single piece of the consolidation.

Step 1: We open the list of “Order Details” and pull in zero, one or multiple entries as new “ItemEntry_[ID]” nodes.

Steps 2 and 3: We reach out and import exactly one Customer node and one Employee node into our master Order record.

(Note that Steps 1, 2 and 3 can really occur in any order because these three entities are completely independent.  However, “Order Details” must have been imported before Step 4 can take place.)

Step 4: We reach out and import the contents of exactly one Product node into each existing ItemEntry node by looking up ProductIDs from existing ItemEntry name/value pairs.

(Note that Step 4 – the Product import – must take place before Steps 5-6 can take place.)

Steps 5-6: We reach out and import the contents of exactly one Category node and exactly one Supplier node into each existing ItemEntry node by looking up CategoryIDs and SupplierIDs from recently imported ItemEntry > Product information.

Step 7: Consolidate the remaining entries under the main Order entry into a new OrderInformation node.

(Step 7 – the consolidation of hanging entries – should always go last in case previous steps depend on keys in their original positions.)

To see complete sample output from each step of the conversion (as generated by the .NET utilities), unpack the archive here.

Running the Batch File

You may need to change the name of the original “NorthwindJSON.txt”  and/or may want to change the name of the final output file (from “JSONout7.txt”), but other than that the batch file should be ready to run.  To capture log and error output so it’s easier to read, use this command:

D:\divconq\dotnet>ConsolidateNorthwindJSON.bat > log.txt 2> err.txt

Next Steps

The next article in this series shows how to import the complete JSON document into Cassandra.

More About The Utilities

ConsolidateJSONObjects is used to dive into a foreign table used to express an N:N relationship from the point of view of one entity in the relationship.  It generates 0-N supercolumn-level JSON nodes for each foreign node discovered.  This utility is used once during the Northwind conversion to bring in 0-N item entries per order.

consolidatejsonobjects NorthwindJSON.txt JSONout.txt Orders "Order Details" ItemEntry_[ID] "OrderID" yes yes 4

DistributeJSONObjects and DistributeJSONObjects2 are used to select foreign keys in the main table and look them up against a foreign table: it works N:1 relationships and brings in exactly one node for each foreign key scanned (assuming DB consistency, of course).   The difference between these utilities is that DistributeJSONObjects imports discovered nodes as supercolumn-level nodes while DistributeJSONObjects2 imports the contents of discovered nodes as column-level name-value pairs under an existing supercolumn-level node.  These utilities are used five times during the Northwind conversion: two invocations of DistributeJSONObjects to import 1-per-order Employee and Customer nodes as supercolumns and three invocations of DistributeJSONObjects2 to import the contents of 1-per-itementry Product, Supplier and Category nodes.

rem Bring in exactly one employee per order as a new supercolumn
distributejsonobjects JSONout.txt JSONout2.txt Orders "Employees" Employee "EmployeeID" yes yes 4
rem .
rem Bring in exactly one customer per order as a new supercolumn
distributejsonobjects JSONout2.txt JSONout3.txt Orders "Customers" Customer "CustomerID" yes yes 4
rem .
rem Bring in exactly one product entry per item entry as new columns in existing supercolumn
distributejsonobjects2 JSONout3.txt JSONout4.txt Orders ItemEntry_* Products Product_ ProductID yes yes 4
rem .
rem Bring in one supplier entry and one category entry per item entry as new columns in existing supercolumn
rem   (note these depend on previously imported Product entries)
distributejsonobjects2 JSONout4.txt JSONout5.txt Orders ItemEntry_* Suppliers Supplier_ Product_SupplierID yes yes 4
distributejsonobjects2 JSONout5.txt JSONout6.txt Orders ItemEntry_* Categories Category_ Product_CategoryID yes yes 4

DemoteJSONPairsIntoNewObject is simply used to clean up lingering columns (name/value pairs) at the “row” level and put them into a new supercolumn-level object.   One invocation of this utility is used during the Northwind conversion.

demotejsonpairsintonewobject JSONout6.txt JSONout7.txt Orders OrderInformation yes yes 4

All utilities will display additional help, including complete documentation for each parameter at the command-line if invoked without any command-line parameters.  (e.g., simply execute “consolidatejsonobjects.exe” to find out what the eighth parameter – “yes” in the example above – means)

Troubleshooting

If you are experimenting with these utilities and data sets of any size it is highly recommended that you put all your commands in a batch file.  Then, from the command prompt, run your batch file like this:

D:\divconq\dotnet>divconq_test23.bat > log.txt 2> err.txt

…to capture logging output and error output in two separate files.

 
Comments Off

Posted in Cassandra, Intermediate, JSON, Northwind

 

Utility to Denormalize JSON Structures by Consolidating Two Object Trees

28 Jul

This article is about a DivConq utility that denormalizes JSON structures by consolidating two object trees at a time.

If you need to learn about JSON first, start here.  If you’re not sure why you’d want to denormalize JSON structures, read this article first.

Downloading and Installing

This utility was written in C# and cross-compiled in for .NET and Mono.  These compilations are identical except:

  • The .NET compilation supports console text colors – the Mono compilation does not
  • The Mono compilation does not (yet) support “utf32″ encoding

To get the .NET executables, download this URL:
http://www.divconq.com/wp-content/uploads/2013/05/JSONUtils_dotnet_v1001.zip

To get the Mono executables, download this URL:
http://www.divconq.com/wp-content/uploads/2013/05/JSONUtils_mono_v1001.zip

After downloading this Zip file, unpack it into a local directory and run your choice of “dotnet” or “mono” executable (each is unpacked to a folder of that name).

Command Syntax

ConsolidateJSONObjects v1.0 - July 26, 2010
  Denormalizes/distributes one JSON object into another.
  Please visit http://www.divconq.com for more information.
  This utility uses the Procurios C#.NET JSON library under the MIT license.

  Usage:
   ConsolidateJSONObjects InputFile OutputFile MergeIntoObject MergeFromObject
     MergeAsName ForeignKey [DeleteFrom] [FormatHack] [DebugLevel]
     ...where:
       InputFile = name or path of input file containing original JSON
       OutputFile = name or path of output file to write final JSON into
       MergeIntoObject = name of JSON object into which object will be merged
       MergeFromObject = name of JSON object to read data from for merge
       MergeAsName = name template of new grandchild objects created in the
           MergeIntoObject - this parameter uses a [ID] macro, e.g., car_[ID]
       ForeignKey = name of column which contains the foreign key value used
           to connect MergeFromObject children with MergeToObject children
       DeleteFrom = Set to NO to preserve MergeFromObject in OutputFile
           Default is YES, which causes MergeFromObject to be deleted.
       FormatHack = Set to NO to prevent utility from injecting CRLF to
           make the output prettier.  Default is YES.
       DebugLevel = Set to 1 to see only warnings, set to 0 to hide those
           too.  Default is 2.  Set to 3 to see a little more information.

  Example: (the following two invocations do exactly the same thing)
   ConsolidateJSONObjects forestandtrees.txt justforests.txt forests
     famoustrees famoustree_[ID] forestID
   ConsolidateJSONObjects forestandtrees.txt justforests.txt forests
     famoustrees famoustree_[ID] forestID YES YES 2

Short Example of Use

Now is a good time to go back and re-read the “Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)” article.  This is exactly what we’re going to do with the ConsolidateJSONObjects utility in this example.

To get started, download the input data file containing the “forests” and “famoustrees” object trees from this URL.  To conserve keystrokes,  save this file in the same directory as either your mono or dotnet edition of ConsolidateJSONObjects.exe.

Next, open a console or terminal and CD into the directory with your  “ConsolidateJSONObjects.exe” executable and the “forestandtrees.txt” file you just downloaded.  Run “ConsolidateJSONObjects.exe” without any arguments – expect to see the console help for the utility displayed.

Now let’s try it again with all six of the required arguments filled out:

  • Input and output JSON files: pretty self-explanatory
  • The “MergeInto” object is where we want all our data to end up.  In this case it’s the “forests” object.
  • The “MergeFrom” object is the object we want to collapse.  In this case it’s the “famoustrees” object.
  • “MergeAsName” is completely up to you: it determines the name/key of the new object added to each child of the MergeInto object.  Unless you’re merging “1:1″ object relationships, this value should contain the string “[ID]“, which will help ensure that object names/keys rename unique in their parent nodes.
  • “ForeignKey” is the name/key of existing grandchildren of the MergeFrom object and is used to link the children of the MergeInto object with the children from the MergeFrom object

The expected output from this command is shown below.  Notice the one warning – this indicates that there was a “famoustree” that could not be chained up to specific “forest”.

C:\jsontools\dotnet>consolidatejsonobjects forestandtrees.txt JSONout.txt forests famoustrees famoustree_[ID] forestID
00:10:38 Found expected organization in the “forests” object.
00:10:38 Found expected organization in the “famoustrees” object.
00:10:38 WARNING: tree23456 Foreign Key=forest002 not found in target JSON struc
ture!
00:10:38 Completed OK. Moved 3 children and found 1 orphans.

Open the “JSONout.txt” file to view the output.  You’ll notice that it’s not as pretty as my hand-formatted input, but the utility revalidates output JSON before writing it so at least you know it’s real JSON.

Take a close look at the “Black Forest” forest entry and notice that it’s grown two “famoustree_*” children, each with a unique name derived from it’s unique key from the old famoustree object.  Also notice that the whole famoustree object is missing from the output – even the entry that generated the warning because it couldn’t be matched was deleted.

If you open the original “forestandtrees.txt” JSON file you should be able to find the source of the warning by searching for the value of “tree23456″.  You can practice taking out this node or adding a stub for the proper forests entry to see how the utility responds, or test it with other data of your choice.

Once you are comfortable with the way the basic six parameters work and consolidation happens, move on and experiment with the three optional parameters.

First, add a value of “NO” to the end of your six values in the seventh slot (the “DeleteFrom” parameter) to prevent the MergeFrom object (e.g., “famoustrees”) from being deleted.  You might use this in a denormalization scenario where you need to consolidate multiple sets of objects (though multiple invocations of this utility) and you need to keep the MergeFrom object around until the final merge.

Next, add a second value of “NO” in the eighth slot (the “FormatHack ” parameter) to prevent the utility from trying to add some whitespace and line feeds to your JSON output.  You would use this if the utility’s attempts to make the JSON output pretty were messing up the data in the JSON output or yielding invalid JSON.  (The format hack also incurs a small, usually negligible performance hit; you could use this flag to avoid that too.)

Finally, try different values from 0-3 in the ninth slot (the “DebugLevel” parameter).  Smaller values yield less output and the utility is currently not coded to care about any value larger than 3.

Next Steps

The next article in this series will show you how to use this utility and others like it to prepare JSON output from the Northwind relational database for import into Cassandra.

 
Comments Off

Posted in Intermediate, JSON

 

Introduction to Cassandra Columns, Super Columns and Rows

21 Jul

This article provides new users the basics they need to understand Cassandra’s “column / super column / row” data model.

Though the focus is not on mechanics, this article assumes you are familiar with adding columns to and requesting data from existing keyspaces on Cassandra. If not, please see my earlier article on that topic.  Knowledge of JSON is also important to understand the data examples below – if you need help here, please see my earlier article on that topic.

Remember that a Cassandra column is basically a “name=value” pair* (e.g., “color=red”).  You can use multiple columns to represent data such as:

    "Price" : "29.99",
    "Section" : "Action Figures"

As you may have seen in my previous article, multiple columns can also be grouped in Cassandra “rows” to handle data  such as:

{
  "Transformer" : {
    "Price" : "29.99",
    "Section" : "Action Figures"
  }
  "GumDrop" : {
    "Price" : "0.25",
    "Section" : "Candy"
  }
  "MatchboxCar" : {
    "Price" : "1.49",
    "Section" : "Vehicles"
  }
}

The keys used to group related columns into rows in this example were “Transformer”, “GumDrop” and “MatchboxCar”.

In my earlier article, we looked at this data structure:

In JSON, this keystore->column family->row->column data structure would be represented like this:

{
  "ToyStore" : {
    "Toys" : {
      "GumDrop" : {
        "Price" : "0.25",
        "Section" : "Candy"
      }
      "Transformer" : {
        "Price" : "29.99",
        "Section" : "Action Figures"
      }
      "MatchboxCar" : {
        "Price" : "1.49",
        "Section" : "Vehicles"
      }
    }
  },
  "Keyspace1" : null,
  "system" : null

}

If you simply wanted to add other types of unrelated collections of information (e.g., “BugCollection” or “PaintColors”), you’d simply keep adding new keyspaces for each new collection.  However, if you needed to keep track of similar collections of data (e.g., your Ohio and New York toy stores instead of a single toy store) you’d need to turn to a different kind of Cassandra element: the “super column”.

To see super columns in action, inspect this keystore->column family->row->super column->column data structure as it appears in JSON:

{
  "ToyCorporation" : {
    "ToyStores" : {
      "Ohio Store" : {
        "Transformer" : {
          "Price" : "29.99",
          "Section" : "Action Figures"
        }
        "GumDrop" : {
          "Price" : "0.25",
          "Section" : "Candy"
        }
        "MatchboxCar" : {
          "Price" : "1.49",
          "Section" : "Vehicles"
        }
      }
      "New York Store" : {
        "JawBreaker" : {
          "Price" : "4.25",
          "Section" : "Candy"
        }
        "MatchboxCar" : {
          "Price" : "8.79",
          "Section" : "Vehicles"
        }
      }
    }
  }
}

This data could also be visualized like this:

Given its late appearance, you might expect that “Ohio Store” and “New York Store” would represent super columns that span multiple rows.   However, the opposite is true:  “Ohio Store” and “New York Store” are now the row keys and entries like “Transformer”, “GumDrop” and “MatchboxCar” have become super columns keys.

Like column keys, super column keys are indexed and sorted by a specific type (e.g., “UTF8Type”, ”AsciiType”, “LongType”, “BytesType”, etc.).    However, like row keys, super column entries have no values of their own; they are simply used to collect other columns.

Notice that the keys of the two groups of super columns do not match.  ({“Transformer”, “GumDrop”, “MatchboxCar”} does not match {“JawBreaker”, “MatchboxCar”}. )  This is not an error: super column keys in different rows do not have to match and often will not.

In a future article I will describe how to how to pass JSON structures that describe data in either row/column or row/super column/column format into and out of Cassandra – stay tuned!

* = We’ll ignore the timestamp element of Cassandra columns for now.  These timestamps are used to reconcile updates from multiple nodes, but don’t worry about that until you understand the whole column/supercolumn thing first.

 
Comments Off

Posted in Beginner, Cassandra, JSON

 

Export a Microsoft Access Database to JSON (Northwind Example)

15 Jul

This article shows how to use a command-line utility to export an existing Microsoft Access database to a text file containing a single JSON structure.

If you do not know what JSON is, please see my earlier article on that subject.

The Microsoft Access database we are going to export today is the (in)famous “Northwinds” example database which millions of Microsoft programmers have been exposed to.

Download and Install the “Northwind” Access Database

First, download a copy of the Northwind Access database from here.

Save the ZIP file and then unpack the “Northwind.mdb” file into a local directory on your Windows file system.  (Optional: open the database with Microsoft Access – note that the utility described below doesn’t need Microsoft Access to work.)

Now open your Windows Control Panel, get into “Administrative Tools” and then open the “Data Sources (ODBC)” panel.   Go to the “System DSN” tab and add a new “Microsoft Access Driver (*.mdb)” entry.  Select your local copy of “Northwind.mdb” when prompted and name your DSN (“Data Source Name”) something like  “NorthwindTest”.

Download and Install the “Access to JSON” Export Utility

We wrote a command-line utility to help you export your Access database as a single piece of JSON.  Download a copy of this here.

Save the ZIP file and then unpack the enclosed files into a local directory on your Windows file system.    To make sure the utility unpacked OK, please open a command prompt, CD into the directory where the executables are and then run “exportaccessdbtojson” from the command line. The expected output is shown below.

C:\utilities>exportaccessdbtojson

ExportAccessDBToJSON v1.0.0.2 -  August 6, 2010
Please visit http://www.divconq.com for more information.

Usage:
ExportAccessDBToJSON NameOfDSN [-JustJSON] [ID4Tables] [> OutputFilename]
…where ID4Tables is a comma-delimited list of tables to which you
want the utility to add an incremented AutoID field.

Example:
To export an existing Access database defined to your system
with an ODBC DSN of “Northwinds” into a file that will contain
a mix of log messages and JSON output, use the following command:
ExportAccessDBToJSON Northwinds > JSONandLog.txt

To export only the JSON output in the output file, use this instead:
ExportAccessDBToJSON Northwinds -Quiet > JustJSON.txt

Export The Access Database (With Log Output)

Start by trying to execute the following command (with the name of your actual DSN in place of “NorthwindTest”:

C:\utilities>exportaccessdbtojson NorthwindTest

If all is well you will see many lines of text stream by ending with something like this:

“Phone” : “(514) 555-2955″,
“Fax” : “(514) 555-2921″,
“HomePage” : “”
}
}
}=-=-=-= END JSON =-=-=-=
22:00:26 Completed OK.

If not, you may see an error like this:

C:\utilities>exportaccessdbtojson BadNorth
22:01:02 ERROR: Could not connect to DSN=BadNorth! System.Data.Odbc.OdbcExce
ption: ERROR [IM002] [Microsoft][ODBC Driver Manager] Data source name not found
and no default driver specified

If things are going poorly, please see the “Troubleshooting” section below.  However, if things went well, please proceed to the next section.

Export The Access Database As A Single JSON File

Now, let’s turn off the extraneous log messages and the superflous “START JSON” and “END JSON” tags.   Let’s also capture our output into a file so we can look at using our favorite text editor.

Use these commands to perform the export and view the output, again substituting in the correct name of your DSN and any output filename you wish:

C:\utilities>exportaccessdbtojson NorthwindTest -justjson > NorthwindJSON.txtC:\utilities>notepad NorthwindJSON.txt

If you would like to just see the expected JSON output from the Northwind database, please download the Zip file that contains that data as “NorthwindJSON.txt” here.

Next Steps

The next article talks about how to fit a simple relational database structure into the form needed by Cassandra today. The article after that expands on the concept to rearrange the Northwind JSON structures you just exported into a Cassandra-friendly format. Then, in a future articl,e I will show how to import the converted JSON into Cassandra.  Stay tuned!

How the Utility Works

This utility connects to the named Access database, lists all the tables, assumes the first column as the key in each database, and then exports the complete contents of each database as a series of string-based name value pairs, collected on a row-by-row basis by the value of the key column.  The “ID4Tables” argument is used to add unique IDs to tables that lack them: e.g., the “Order Details” table in Northwinds.  JSON structures created will include escaped special characters (e.g., “\\” to represent a backslash) but are not passed through a full JSON validation step before they are written so they are not 100% guaranteed to work in your JSON parser.  (These structures do, of course, work with the Northwind Access database.)

This utility can be used on databases other than the Northwind sample database but has not been extensively tested against many others – your mileage may vary.

Troubleshooting

Most problems with this process are caused by typos in your command-line invocation or in your DSN name in the ODBC setting.  Also make sure you set up a System, not a User DSN entry.

Also, please avoid DSNs with spaces or other characters.   The utility wasn’t tested against these.

 
Comments Off

Posted in Beginner, JSON, Northwind

 

What is “JSON” and what does it have to do with distributed computing?

02 Jul

JSON” stands for “JavaScript Object Notation” and is an efficient way to transfer complex information about specific entities between two separate programs.

As the “JavaScript” name implies, JSON is often used to transfer information between JavaScript-interpreting web browsers and JSON-aware web applications.  In fact, native understanding of JSON is now built into most web browsers’ JavaScript interpreters.

However, JSON is also implemented (as of June 2010) in about 50 other languages using more than 100 tested libraries.  Furthermore, the specification is short and elegant enough for any reasonably competent programmer to build his or her own JSON parser in relatively short order.

JSON can be used to represent primitive data types, such as one integer (e.g., “1″) or one string (e.g., “fred”), but JSON is more frequently used to represent complex data including specifically named attributes (e.g., “color=blue, ID=342…”), arrays of values (e.g., “7,6,8…”) or even arrays of attributes.

Let’s take a look now at a simple JSON sample.  This is a record that describes a complex data structure: a person.

{
  "firstName": "Elwood",
  "lastName": "Blues", 
  "SSN": 111-11-1111,
  "address": {
    "addressLine1": "1060 West Addison St.",
    "addressLine2": "Suite 1313",
    "city": "Chicago",
    "state": "IL",
    "postalCode": "60613"
  },
  "autoViolations": [
    { "type": "speeding", "amount": "145.25" },
    { "type": "parking", "amount": "15.00" }
  ]
}

There are three simple objects (“firstName”, “lastName”, “SSN”), each of which is made up of a name/value pair of strings.   There is also a complex object (“address”), which is itself made up of 5 simple objects and a second complex object (“autoViolations”), which contains an array of similar objects.

Veterans of XML will note that this same data structure could easily be represented in XML, but JSON advocates favor JSON over XML because:

  • JSON does more with fewer characters (e.g., XML “<repeats></repeats>” the names of keys)
  • JSON maps more cleanly into major programming languages than XML (almost all languages have strings, numbers and object/class and array data structures)
  • JSON does not stray into the “can also be used to display data” camp like XML does (with XSLT transformation) because JSON is not a document markup language

JSON has a couple of years to go before it catches up with XML in popularity among the general programming community, but JSON has clearly leapt out in front with cutting edge programmers working on state-of-the-art presentation frameworks such as Qooxdoo or highly distributed databases such as Cassandra.

As these better presentation frameworks and more distributed computing concepts work their way into the mainstream, JSON’s use and relevance will only increase.

 
Comments Off

Posted in Cassandra, Introduction, JSON, XML