RSS

Monthly Archives: July 2010

Cloud Escrow: The Ability to Choose and Change Your Deployment Model

In a recent TheRegister post entitled “The cloud’s impact on security”, Tony Lock provides a definition for the groundbreaking concept of “Cloud Escrow”.

“…if you are using external cloud resources, look at how the data and any intellectual property invested in the processing engines employed to manipulate data can be moved to other third party cloud providers, or back into the enterprise, if you need to do that. You could call this ‘Cloud Escrow’.”

Readers of DivConq and other cloud technology blogs are probably already familiar with the term “Cloud Portability” – the ability to move cloud applications and data between different cloud providers or to receive the same services from multiple cloud providers at once.

However, what Lock does with his “Cloud Escrow” definition is remind people that the ability for companies to redeploy entire sets of cloud-deployed applications or data back into company-owned systems or private clouds is extremely important in case:

  • a merger or divestiture impacts IT service delivery
  • a regulatory change or legal ruling requires quick action
  • currently contracted cloud vendors are acquired by a questionable owner or are unable to meet their service level agreements (SLAs) with current ownership

DivConq applauds Lock’s contribution of “Cloud Escrow” to the ongoing discussions being held at every level about the appropriate way to deploy resources into the cloud.   The answer, as always, is to have a realistic fallback plan in case conditions change in a hurry.

 

Posted by on 2010-Jul-07 in Cloud, Regulation

Comments Off

Migrate a Relational Database into Cassandra (Part II – Northwind Planning)

This article describes how to fit a relational database structure containing an N:N relationship and several N:1 relationships into a Cassandra-friendly structure.

You should be familiar with the concepts in my “Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)” article before proceeding.

In this article we are going to examine the “Northwind” database shipped as an example with Microsoft Access.  Having a local copy of this database not needed to understand this article, but if you are interesting in obtaining a copy (and/or want to see how to extract it to JSON), visit this article and then return.

Northwind Database Schema

The relationships of the various tables that make up the Northwind Access database can be summed up like this:

Customer }                            { Supplier
Employee } 1:N Order N:N Products N:1 { Category

In other words:

  • Every order is tied to a single customer and a single employee
  • Every product is tied to a single supplier and a single category
  • Every order can have a variety of different products
  • Every product can be part of a variety of different orders

Notice that I’m ignoring the “N” part of the customer, employee, supplier and category relationships for now.  I’m also ignoring the “Order Details” table that provides the links necessary to implement the “N:N” relationship depicted above.

Instead, I’m focusing on the important “N:N” relationship between order and product.  In a relational database setting I can often simply normalize away and worry about which table I need to focus my indexing efforts on later when I’m working in the application.  However, in NoSQL, non-relational database design, we often need to decide up front which entity most queries will be interested in and build everything else around that entity.

So…will it be “order” or “product”?  Today I’ll decide that the key entity in this database is “order” – customers will be hitting this on a daily, per transaction basis whereas I can probably run my product reports offline.

Sketching Out a Viable Cassandra Schema for the Northwind Database

OK, so “order” is our key element.  Let’s see how we can squeeze that into a Cassandra-friendly data structure.  First, let’s review Cassandra’s keystore->column family->row->super column->column data structure.  (If you need more of a refresher, see this.)

{
  "keystore" : {
    "column family" : {
      "row key" : {
        "supercolumn key" : {
          "column name" : "value"
        }
      }
    }
  }
}

Remember that the first two levels of this structure (keystore->column family) will essentially be burned up with the name of the database and the name of table.  That leaves us one level (row) for each “order” entry and one level (supercolumn) for each subcomponent of each “order” entry (such as a block of employee information).  The remaining level (column) has to be used for the name/value pairs that contain our information.

That’s a tight fit, but not an impossible one.  Let’s get started on the side of the relational diagram that only contains N:1 relationships.

Customer }
Employee } 1:N Order

Again, we’re going to ignore the fact that a single customer can be tied to multiple orders and that a single employee can also be tied to multiple orders.  Instead, we’re going to focus on the fact that a single order has one customer and one employee – period.

Here’s a data structure that accommodates that line of thinking.

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "OrderInformation" : {
          "column name" : "value"
        }
        "Employee" : {
          "column name" : "value"
        }
        "Customer" : {
          "column name" : "value"
        }
      }
    }
  }
}

So far, so good.  But where did that extra “OrderInformation” entry come from?  The answer is that we demoted the original information from the original order record so that column name/value pairs are always at the same level in our Cassandra database.  (More on that.)

Tackling the N:N Relationship

Now what should we do about the “N:N” relationship from orders to products?

                       { Supplier
Order N:N Products N:1 { Category

We could call each N:N Order:Product relationship an “ItemEntry” and consider doing something like this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntry_ID]" : {
          "ItemInformation" : {
            "column name" : "value"
          }
          "Product" : {
            "column name" : "value"
            "Category" : {
               "column name" : "value"
             }
            "Supplier" : {
               "column name" : "value"
             }
          }
        }
      }
    }
  }
}

…but that would be illegal because it requires two more levels than Cassandra allows (e.g., “Product > Category”).

Instead we need a structure that looks more like this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntryID]_ItemInformation" : {
            "column name" : "value"
          }
          "[ItemEntryID]_Product" : {
            "column name" : "value"
          }
          "[ItemEntryID]_ProductSupplier" : {
            "column name" : "value"
          }
          "[ItemEntryID]_ProductCategory" : {
            "column name" : "value"
          }
        }
      }
    }
  }
}

…or this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntryID]" : {
            "ItemInformation_column1" : "value",
            "ItemInformation_column2" : "value",
            "Product_column1" : "value",
            "Product_column2" : "value",
            "ProductSupplier_column1" : "value",
            "ProductSupplier_column2" : "value",
            "ProductCategory_column1" : "value",
            "ProductCategory_column2" : "value"
          }
        }
      }
    }
  }
}

It looks good on paper but there’s an extra problem with the “Order Details” database we’re using to provide the N:N relationship in the Northwind database.  The extra problem is that there are no unique IDs on the Order Detail entries.  With that in mind, we’ll need to take special care during export/consolidations of legacy Order Detail data and with new entries in the future to automatically populate unique entries for each Order Detail as we add it.  (This isn’t a big deal – we could either go with unique counts – e.g., 1, 2, 3… – within each Order entry or commit to the use of GUIDs for the same purpose)  This exercise uses unique IDs added to the original Northwind JSON document through the use of the “ID4Tables” option in the ExportAccessDBToJSON utility.

With all that in place, here is our final, Cassandra-ready data structure with an “order”-centric focus and all the information from our original Northwind relational structure.

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "OrderInformation" : {
          "column name" : "value"
        }
        "Employee" : {
          "column name" : "value"
        }
        "Customer" : {
          "column name" : "value"
        }
        "ItemEntry_[ItemEntryID]" : {
            "ItemInformation_column1" : "value",
            "ItemInformation_column2" : "value",
            "Product_column1" : "value",
            "Product_column2" : "value",
            "ProductSupplier_column1" : "value",
            "ProductSupplier_column2" : "value",
            "ProductCategory_column1" : "value",
            "ProductCategory_column2" : "value"
        }
      }
    }
  }
}

Next Steps

Next I will show exactly how to use DivConq utilities to implement our plan and convert our Northwind JSON to a Cassandra-ready format.

 

Posted by on 2010-Jul-07 in Cassandra, Intermediate, Northwind

Comments Off

Utility to Denormalize JSON Structures by Consolidating Two Object Trees

This article is about a DivConq utility that denormalizes JSON structures by consolidating two object trees at a time.

If you need to learn about JSON first, start here.  If you’re not sure why you’d want to denormalize JSON structures, read this article first.

Downloading and Installing

This utility was written in C# and cross-compiled in for .NET and Mono.  These compilations are identical except:

  • The .NET compilation supports console text colors – the Mono compilation does not
  • The Mono compilation does not (yet) support “utf32″ encoding

To get the .NET executables, download this URL:
http://www.divconq.com/wp-content/uploads/2013/05/JSONUtils_dotnet_v1001.zip

To get the Mono executables, download this URL:
http://www.divconq.com/wp-content/uploads/2013/05/JSONUtils_mono_v1001.zip

After downloading this Zip file, unpack it into a local directory and run your choice of “dotnet” or “mono” executable (each is unpacked to a folder of that name).

Command Syntax

ConsolidateJSONObjects v1.0 - July 26, 2010
  Denormalizes/distributes one JSON object into another.
  Please visit http://www.divconq.com for more information.
  This utility uses the Procurios C#.NET JSON library under the MIT license.

  Usage:
   ConsolidateJSONObjects InputFile OutputFile MergeIntoObject MergeFromObject
     MergeAsName ForeignKey [DeleteFrom] [FormatHack] [DebugLevel]
     ...where:
       InputFile = name or path of input file containing original JSON
       OutputFile = name or path of output file to write final JSON into
       MergeIntoObject = name of JSON object into which object will be merged
       MergeFromObject = name of JSON object to read data from for merge
       MergeAsName = name template of new grandchild objects created in the
           MergeIntoObject - this parameter uses a [ID] macro, e.g., car_[ID]
       ForeignKey = name of column which contains the foreign key value used
           to connect MergeFromObject children with MergeToObject children
       DeleteFrom = Set to NO to preserve MergeFromObject in OutputFile
           Default is YES, which causes MergeFromObject to be deleted.
       FormatHack = Set to NO to prevent utility from injecting CRLF to
           make the output prettier.  Default is YES.
       DebugLevel = Set to 1 to see only warnings, set to 0 to hide those
           too.  Default is 2.  Set to 3 to see a little more information.

  Example: (the following two invocations do exactly the same thing)
   ConsolidateJSONObjects forestandtrees.txt justforests.txt forests
     famoustrees famoustree_[ID] forestID
   ConsolidateJSONObjects forestandtrees.txt justforests.txt forests
     famoustrees famoustree_[ID] forestID YES YES 2

Short Example of Use

Now is a good time to go back and re-read the “Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)” article.  This is exactly what we’re going to do with the ConsolidateJSONObjects utility in this example.

To get started, download the input data file containing the “forests” and “famoustrees” object trees from this URL.  To conserve keystrokes,  save this file in the same directory as either your mono or dotnet edition of ConsolidateJSONObjects.exe.

Next, open a console or terminal and CD into the directory with your  “ConsolidateJSONObjects.exe” executable and the “forestandtrees.txt” file you just downloaded.  Run “ConsolidateJSONObjects.exe” without any arguments – expect to see the console help for the utility displayed.

Now let’s try it again with all six of the required arguments filled out:

  • Input and output JSON files: pretty self-explanatory
  • The “MergeInto” object is where we want all our data to end up.  In this case it’s the “forests” object.
  • The “MergeFrom” object is the object we want to collapse.  In this case it’s the “famoustrees” object.
  • “MergeAsName” is completely up to you: it determines the name/key of the new object added to each child of the MergeInto object.  Unless you’re merging “1:1″ object relationships, this value should contain the string “[ID]”, which will help ensure that object names/keys rename unique in their parent nodes.
  • “ForeignKey” is the name/key of existing grandchildren of the MergeFrom object and is used to link the children of the MergeInto object with the children from the MergeFrom object

The expected output from this command is shown below.  Notice the one warning – this indicates that there was a “famoustree” that could not be chained up to specific “forest”.

C:\jsontools\dotnet>consolidatejsonobjects forestandtrees.txt JSONout.txt forests famoustrees famoustree_[ID] forestID
00:10:38 Found expected organization in the “forests” object.
00:10:38 Found expected organization in the “famoustrees” object.
00:10:38 WARNING: tree23456 Foreign Key=forest002 not found in target JSON struc
ture!
00:10:38 Completed OK. Moved 3 children and found 1 orphans.

Open the “JSONout.txt” file to view the output.  You’ll notice that it’s not as pretty as my hand-formatted input, but the utility revalidates output JSON before writing it so at least you know it’s real JSON.

Take a close look at the “Black Forest” forest entry and notice that it’s grown two “famoustree_*” children, each with a unique name derived from it’s unique key from the old famoustree object.  Also notice that the whole famoustree object is missing from the output – even the entry that generated the warning because it couldn’t be matched was deleted.

If you open the original “forestandtrees.txt” JSON file you should be able to find the source of the warning by searching for the value of “tree23456″.  You can practice taking out this node or adding a stub for the proper forests entry to see how the utility responds, or test it with other data of your choice.

Once you are comfortable with the way the basic six parameters work and consolidation happens, move on and experiment with the three optional parameters.

First, add a value of “NO” to the end of your six values in the seventh slot (the “DeleteFrom” parameter) to prevent the MergeFrom object (e.g., “famoustrees”) from being deleted.  You might use this in a denormalization scenario where you need to consolidate multiple sets of objects (though multiple invocations of this utility) and you need to keep the MergeFrom object around until the final merge.

Next, add a second value of “NO” in the eighth slot (the “FormatHack ” parameter) to prevent the utility from trying to add some whitespace and line feeds to your JSON output.  You would use this if the utility’s attempts to make the JSON output pretty were messing up the data in the JSON output or yielding invalid JSON.  (The format hack also incurs a small, usually negligible performance hit; you could use this flag to avoid that too.)

Finally, try different values from 0-3 in the ninth slot (the “DebugLevel” parameter).  Smaller values yield less output and the utility is currently not coded to care about any value larger than 3.

Next Steps

The next article in this series will show you how to use this utility and others like it to prepare JSON output from the Northwind relational database for import into Cassandra.

 

Posted by on 2010-Jul-07 in Intermediate, JSON

Comments Off

Google Joins Microsoft In GeoPolitical Private Cloud Deployment

While attending the RSA conference in San Francisco this year I wrote a brief article for another blog about Microsoft establishing a geopolitical private cloud for U.S. government use.  At that time I wrote:

As noted by Gavin Clark in The Register:
http://www.theregister.co.uk/2010/02/27/microsoft_government_cloud/
“Among the features (in Microsoft’s latest U.S. government cloud offerings) are secured and separate hosting facilities access, to which is restricted to a small number of US citizens who have cleared rigorous background checks under the International Traffic in Arms Regulations (ITAR).”

In other words, Microsoft has defined a large private cloud segment that will never span political boundaries.   However, not every Federal process must comply with ITAR or even the higher levels of FISMA.  It will be interesting to see whether other cloud vendors follow suit with their own private offerings or if private government clouds restricted to and maintained in a single country are just a niche.

As predicted, Google has taken Microsoft’s lead here.  TheRegister’s Cade Metz notes this in  today’s article entitled “Google Apps rubber-stamped for use by US gov”.

The new service segregates Gmail and Google Calendar data in a section of Google’s back-end infrastructure that’s separate from services used by non-government users, and all the data centers housing these segregated applications are located in the continental United States. Google says that in the future, it will segregate other applications in this way.

So…that’s two major cloud vendors getting behind permanent private clouds delineated by geopolitical boundaries.  Who’s next?

 

Posted by on 2010-Jul-07 in Cloud, Other Organizations, Regulation

Comments Off

Cloud Portability: Demand It!

By now you’ve been deluged by advertising extolling the benefits of cloud computing: use only what you need, scale up or scale down and take advantage of widely, someday geographically, distributed computers and networks.

However, two simple facts often get swept under the rug:

1) Most cloud providers are trying to lock you in to their stack of application and data providers.

Amazon provides an application stack (EC2 and more) and data stack (S3, SimpleDB, etc.) that isn’t compatible with Microsoft’s application stack (.NET w/ Azure) and data stack (SQL Azure, etc.), and neither are compatible with other clouds such as those from IBM, Google and others.

When you develop an application for a specific stack of cloud application and cloud data storage infrastructure, you tie yourself to the viability and good behavior (pricing, availability, etc.) of that specific, nonportable technology.

2) When you scale up/down you do so at the prices your cloud provider sets.

Unless you are a high-usage customer of a particular vendor’s cloud services you will have little ability to negotiate a price with your cloud.   Even if you are a high-usage customer, your leverage to negotiate will be significantly reduced by the fact that you will be heavily dependent on (locked in by) the application and data stack of you cloud provider.

Developing Cloud Portable Applications

Make no mistake: cloud computing IS an important movement and companies and developers need to start taking advantage of it NOW.  However, there are technologies and architectures you can use to protect yourself from cloud lock-in.  By developing on cross-platform, cloud-ready technologies like Cassandra, Nginx and lighttp rather than the proprietary technologies offered directly by cloud vendors, you gain the ability to:

1) pick up and move your critical applications if you current cloud provider fails to meet its service level agreement (SLA)

2) shop around for price and put pressure on your cloud provider to give you discounts on your hosted services

3) host simultaneously on multiple clouds or your own equipment for the exactly the right combination of worldwide availability and price for your enterprise

DivConq Provides A Bridge to Cloud Portability

In our founding post, we dedicated ourselves to technology discussions that involved “highly distributed, heterogeneous systems”.  If you solve that challenge, and two or more of your systems are clouds, you’ve effectively solved cloud portability.  Please stay tuned as we continue to explore and explain the generation of technology that will bridge the cloud portability gap.

 

Posted by on 2010-Jul-07 in Cloud, DivConq, Elastic Architecture

Comments Off

Introduction to Cassandra Columns, Super Columns and Rows

This article provides new users the basics they need to understand Cassandra’s “column / super column / row” data model.

Though the focus is not on mechanics, this article assumes you are familiar with adding columns to and requesting data from existing keyspaces on Cassandra. If not, please see my earlier article on that topic.  Knowledge of JSON is also important to understand the data examples below – if you need help here, please see my earlier article on that topic.

Remember that a Cassandra column is basically a “name=value” pair* (e.g., “color=red”).  You can use multiple columns to represent data such as:

    "Price" : "29.99",
    "Section" : "Action Figures"

As you may have seen in my previous article, multiple columns can also be grouped in Cassandra “rows” to handle data  such as:

{
  "Transformer" : {
    "Price" : "29.99",
    "Section" : "Action Figures"
  }
  "GumDrop" : {
    "Price" : "0.25",
    "Section" : "Candy"
  }
  "MatchboxCar" : {
    "Price" : "1.49",
    "Section" : "Vehicles"
  }
}

The keys used to group related columns into rows in this example were “Transformer”, “GumDrop” and “MatchboxCar”.

In my earlier article, we looked at this data structure:

In JSON, this keystore->column family->row->column data structure would be represented like this:

{
  "ToyStore" : {
    "Toys" : {
      "GumDrop" : {
        "Price" : "0.25",
        "Section" : "Candy"
      }
      "Transformer" : {
        "Price" : "29.99",
        "Section" : "Action Figures"
      }
      "MatchboxCar" : {
        "Price" : "1.49",
        "Section" : "Vehicles"
      }
    }
  },
  "Keyspace1" : null,
  "system" : null

}

If you simply wanted to add other types of unrelated collections of information (e.g., “BugCollection” or “PaintColors”), you’d simply keep adding new keyspaces for each new collection.  However, if you needed to keep track of similar collections of data (e.g., your Ohio and New York toy stores instead of a single toy store) you’d need to turn to a different kind of Cassandra element: the “super column”.

To see super columns in action, inspect this keystore->column family->row->super column->column data structure as it appears in JSON:

{
  "ToyCorporation" : {
    "ToyStores" : {
      "Ohio Store" : {
        "Transformer" : {
          "Price" : "29.99",
          "Section" : "Action Figures"
        }
        "GumDrop" : {
          "Price" : "0.25",
          "Section" : "Candy"
        }
        "MatchboxCar" : {
          "Price" : "1.49",
          "Section" : "Vehicles"
        }
      }
      "New York Store" : {
        "JawBreaker" : {
          "Price" : "4.25",
          "Section" : "Candy"
        }
        "MatchboxCar" : {
          "Price" : "8.79",
          "Section" : "Vehicles"
        }
      }
    }
  }
}

This data could also be visualized like this:

Given its late appearance, you might expect that “Ohio Store” and “New York Store” would represent super columns that span multiple rows.   However, the opposite is true:  “Ohio Store” and “New York Store” are now the row keys and entries like “Transformer”, “GumDrop” and “MatchboxCar” have become super columns keys.

Like column keys, super column keys are indexed and sorted by a specific type (e.g., “UTF8Type”, ”AsciiType”, “LongType”, “BytesType”, etc.).    However, like row keys, super column entries have no values of their own; they are simply used to collect other columns.

Notice that the keys of the two groups of super columns do not match.  ({“Transformer”, “GumDrop”, “MatchboxCar”} does not match {“JawBreaker”, “MatchboxCar”}. )  This is not an error: super column keys in different rows do not have to match and often will not.

In a future article I will describe how to how to pass JSON structures that describe data in either row/column or row/super column/column format into and out of Cassandra – stay tuned!

* = We’ll ignore the timestamp element of Cassandra columns for now.  These timestamps are used to reconcile updates from multiple nodes, but don’t worry about that until you understand the whole column/supercolumn thing first.

 

Posted by on 2010-Jul-07 in Beginner, Cassandra, JSON

Comments Off

Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)

This article beings to explore how to migrate a relational database structure (tables linked by foreign keys) into a NoSQL database structure that can be used in Cassandra.

If you do not know what Cassandra is, why NoSQL and Cassandra are important technologies or what JSON is and why you should know it, please click the links in this sentence to learn more about each topic before proceeding.

The Original Relational Database Structure

We are going to start with a very simple 1:N relational database structure. Our first two tables are “forests” and “famoustrees”.  Here is our data in tabular format:

forests:

famoustrees:

“famoustrees” is linked to “forests” using the “forestID” foreign key.  Notice that there are no famous trees in the “Lonely Grove” forest, one famous tree in the “100 Acre Woods” and two famous trees in the “Black Forest”.

If we were to represent the data in our database – call it our “biologicalfeatures” database – in JSON, it would look like this:

{
  "biologicalfeatures":
    {
    "forests" :
      {
      "forest003" :
        {
          "name" : "Black Forest",
          "trees" : "two million",
          "bushes" : "three million"
        },
      "forest045" :
        {
          "name" : "100 Acre Woods",
          "trees" : "four thousand",
          "bushes" : "five thousand"
        },
      "forest127" :
        {
          "name" : "Lonely Grove",
          "trees" : "none",
          "bushes" : "one hundred"
        }
      },
    "famoustrees" :
      {
      "tree12345" :
        {
          "forestID" : "forest003",
          "name" : "Der Tree",
          "species" : "Red Oak"
        },
      "tree12399" :
        {
          "forestID" : "forest045",
          "name" : "Happy Hunny Tree",
          "species" : "Willow"
        },
      "tree32345" :
        {
          "forestID" : "forest003",
          "name" : "Das Ubertree",
          "species" : "Blue Spruce"
        }
      }
    }
}

Denormalizing the Tables

To collapse the famoustrees table into our forests table, we need to move each famoustree entry underneath its forest entry.  We can also also remove the foreign “forestID” key from each famoustree entry – we don’t need that anymore.

However, we should retain the type of each famoustree entry we moved into the forest entry.  We can do this by adding an extra “type” value to each entry.

Finally, we could break out the original non-ID information in each forest entry into a typed section too.  We’ll tag each of these sections with a new ID of “generalinfo”.  (This is a Cassandra-friendly convention – we’ll get into this more below.)

Represented in JSON, our data now looks like this:

{
  "biologicalfeatures":
    {
    "forests" :
      {
      "forest003" :
        {
        "generalinfo" :
          {
          "name" : "Black Forest",
          "trees" : "two million",
          "bushes" : "three million"
          },
        "tree12345" :
          {
            "type" : "famoustree",
            "name" : "Der Tree",
            "species" : "Red Oak"
          },
        "tree32345" :
          {
            "type" : "famoustree",
            "name" : "Das Ubertree",
            "species" : "Blue Spruce"
          }
        },
      "forest045" :
        {
        "generalinfo" :
          {
          "name" : "100 Acre Woods",
          "trees" : "four thousand",
          "bushes" : "five thousand"
          },
        "tree12399" :
          {
            "type" : "famoustree",
            "name" : "Happy Hunny Tree",
            "species" : "Willow"
          }
        },
      "forest127" :
        {
        "generalinfo" :
          {
          "name" : "Lonely Grove",
          "trees" : "none",
          "bushes" : "one hundred"
          }
        }
      }
    }
}

Ready for Cassandra?

There are really only two types of JSON data structures that can be imported directly into Cassandra.  One is the
keystore->columnfamily->rowkey->column
data structure shown below:

{
  "keystore":
    {
    "columnfamily" :
      {
      "rowkey" :
        {
          "column name" : "column value"
        }
      }
    }
}

Add another layer and you get the other supported data structure
keystore->columnfamily (a.k.a. “supercolumnfamily”)->rowkey->supercolumn (a.k.a. “subcolumn”)->column
shown below:

{
  "keystore":
    {
    "columnfamily" :
      {
      "rowkey" :
        {
        "supercolumn" :
          {
          "column name" : "column value"
          }
        }
      }
    }
}

That’s it: if you can get your data to fit into one of those two JSON structures, your data is ready to be input into Cassandra.

You probably suspect that I wouldn’t have taken you this far if our forests data wasn’t ready for Cassandra, but please take a moment to scroll up and see if you can figure out whether our denormalized forests data uses supercolumns or not.

Let’s break it down:
biologicalfeatures -> forests
…matches the keystore->columnfamily structure used by both supported JSON structures.

As for the rest:
forest003 -> generalinfo -> (name=”Black Forest”)
…matches the rowkey->supercolumn->column structure used by the “supercolumn” supported JSON structure.

So, yes, we had to use supercolumns to denormalize the forests and famoustrees tables properly.

Next Steps

Next we’ll perform this type of analysis on the Northwind JSON structure exported in a previous article.

Doing this type of normalization by hand would be a large PITA, so DivConq created a utility to do this automatically. The article after that shows how to use that DivConq utility and a few more like it to complete the conversion of the Northwind JSON in Cassandra-ready format.

Soon after that I will also cover how to import the final JSON data directly into Cassandra – stay tuned!

 

Posted by on 2010-Jul-07 in Beginner, Cassandra, nosql

Comments Off

Rackspace Endorses Cloud Portability

Like the authors of divconq, the people at Rackspace (a large U.S.-based hosting provider) have repeatedly tossed their hats into the ring of “cloud portability”.  Today, Rackspace cemented its status as a cloud portable vendor by announcing “OpenStack”: an open source platform built, in part, through a collaboration with NASA (U.S. government space agency).

http://www.rackspace.com/information/mediacenter/release.php?id=8489

Rackspace wasn’t acting alone here either – executives from several partners were quoted:

“We believe in offering customers choice in cloud computing that helps them improve efficiency,” says Forrest Norrod, Vice President and General Manager of Server Platforms at Dell.

“(This) provides a solid foundation for promoting the emergence of cloud standards and interoperability,” said Peter Levine, SVP and GM, Datacenter and Cloud Division at Citrix Systems.

…and other companies listed in the announcement included: AMD, Autonomic Resources, Cloud.com, Cloudkick, Cloudscaling, CloudSwitch, enStratus, FathomDB, Intel, iomart Group, Limelight, Nicira, NTT DATA, Opscode, PEER 1, Puppet Labs, RightScale, Riptano, Scalr, SoftLayer, Sonian, Spiceworks, Zenoss and Zuora.

In its advertising Rackspace demonstrates that it truly understands cloud portability.  They say their initiative:

  • Prevents vendor lock-in
  • Increases flexibility in deployment for a highly elastic commodity cloud
  • Offers a bigger, more robust ecosystem for more tools, better capabilities and a stronger platform
  • Gives you the freedom to decide how you want your cloud
  • Drives greater industry standards
  • Increases the speed of innovation in cloud technologies

Let’s hope other cloud vendors take note of this initiative and encourage the same open approaches that brought us Linux to bring us the next wave of innovation in the cloud!

 

Posted by on 2010-Jul-07 in Cloud, Other Organizations

Comments Off

Web Server Threading Models

Being very performance minded I’d like to take a tangent to server performance. Server performance is tightly connected to the choice of threading model. In particular I’m referring to the threading model of dynamic web pages: PHP, ASP, JSP, Rails, etc. About the worst thing you can do for performance is to have a thread just waiting. Just throwing more threads at the problem does not resolve it – though it can help. But preventing any waits at all is the superior solution by far.

At a high level there are three thread modeling choices in server design:

1) A single thread reads the request, then calls the script that generates the page and waits for the script to complete after which any final writes are completed. The thread stays with the request from the first byte read to the last byte written. There are ways to make this model even less efficient, but lets leave it at that.

This approach can result in a lot of wait time on the thread while it waits for request buffers to read and waits for ACKs when writing response buffers.

There is also an inherent issue in that all the data about the request must be collected up front before the script is called. This means waiting for all the request buffers to be delivered and, in the case of large requests (file uploads) caching the request on disk.

There are two problems with that:

a) memory (RAM or disk) is being filled up by request data and often by response data too

b) data that gets on disk may be vulnerable since it will not typically be encrypted

2) Another approach – called async I/O – is to keep a pool of threads that is used to read requests a buffer full at a time. The thread is active only long enough to process that one buffer and then the thread goes back into the pool. Once the request is completely assembled a script is called and the server waits for the script to complete.

This approach gets past the issue with waiting the reads and writes (when response buffers are used) but still has the memory issues, security issues and the wait on the script. Another way to state that is the script *must* return a complete response and that the server dedicates at least one thread to the script for the entire duration of the response generation.

An exception is when response buffering is not used, however, when response buffering is not used the async I/O benefits are lost for the output.

Note the server’s thread line has holes in it now – a good thing because it frees resources.

3) Another approach – called full async – uses a pool of async I/O threads for reading and writing to the client. However, it also allows for the script generating the response to be fully async.

Note another hole appears during the run of the script because the script has made an async operation itself. This capability is very important now that we use web services so often – it makes no sense to block a thread while calling a web service.

The most evolved examples of full async are perhaps better named streaming async. Streaming async gets around the problem of caching files for upload or download – while still retaining the async I/O model.

I myself have developed a couple of streaming (full) async servers at my work, unfortunately they are not open source. At some point I’ll discuss streaming async in more detail.

The full async server concept is gaining ground. Some examples are the new comer nodejs which does not send a response until you call ‘end’ on the response object. Meaning the script’s thread can exit without the response being sent. This is not at all like PHP, ASP or other standard web servers.

Another example is upcoming Ruby on Rails version 3, check out the
async_sinatra project and async_rails project. They are going so far as to make even calls to the database async – a great idea! Keep in mind my articles about Nginx when checking out Thin.

Yet another example is the Kayak HTTP Server’s responder interface. Kayak is a Dotnet based web server.

In the world of Java, the soon to be released Jetty 7 has extended the Continuations API (previously only used with Comted) for all web scripts so now Java 3.0 servlets can be full async as well.

From the examples we can see that only in the last year has the full async model become trendy, and for good reason, it is very important for scalability of a web server.

 

Posted by on 2010-Jul-07 in Beginner, Elastic Architecture

Comments Off

Export a Microsoft Access Database to JSON (Northwind Example)

This article shows how to use a command-line utility to export an existing Microsoft Access database to a text file containing a single JSON structure.

If you do not know what JSON is, please see my earlier article on that subject.

The Microsoft Access database we are going to export today is the (in)famous “Northwinds” example database which millions of Microsoft programmers have been exposed to.

Download and Install the “Northwind” Access Database

First, download a copy of the Northwind Access database from here.

Save the ZIP file and then unpack the “Northwind.mdb” file into a local directory on your Windows file system.  (Optional: open the database with Microsoft Access – note that the utility described below doesn’t need Microsoft Access to work.)

Now open your Windows Control Panel, get into “Administrative Tools” and then open the “Data Sources (ODBC)” panel.   Go to the “System DSN” tab and add a new “Microsoft Access Driver (*.mdb)” entry.  Select your local copy of “Northwind.mdb” when prompted and name your DSN (“Data Source Name”) something like  “NorthwindTest”.

Download and Install the “Access to JSON” Export Utility

We wrote a command-line utility to help you export your Access database as a single piece of JSON.  Download a copy of this here.

Save the ZIP file and then unpack the enclosed files into a local directory on your Windows file system.    To make sure the utility unpacked OK, please open a command prompt, CD into the directory where the executables are and then run “exportaccessdbtojson” from the command line. The expected output is shown below.

C:\utilities>exportaccessdbtojson

ExportAccessDBToJSON v1.0.0.2 –  August 6, 2010
Please visit http://www.divconq.com for more information.

Usage:
ExportAccessDBToJSON NameOfDSN [-JustJSON] [ID4Tables] [> OutputFilename]
…where ID4Tables is a comma-delimited list of tables to which you
want the utility to add an incremented AutoID field.

Example:
To export an existing Access database defined to your system
with an ODBC DSN of “Northwinds” into a file that will contain
a mix of log messages and JSON output, use the following command:
ExportAccessDBToJSON Northwinds > JSONandLog.txt

To export only the JSON output in the output file, use this instead:
ExportAccessDBToJSON Northwinds -Quiet > JustJSON.txt

Export The Access Database (With Log Output)

Start by trying to execute the following command (with the name of your actual DSN in place of “NorthwindTest”:

C:\utilities>exportaccessdbtojson NorthwindTest

If all is well you will see many lines of text stream by ending with something like this:

“Phone” : “(514) 555-2955″,
“Fax” : “(514) 555-2921″,
“HomePage” : “”
}
}
}=-=-=-= END JSON =-=-=-=
22:00:26 Completed OK.

If not, you may see an error like this:

C:\utilities>exportaccessdbtojson BadNorth
22:01:02 ERROR: Could not connect to DSN=BadNorth! System.Data.Odbc.OdbcExce
ption: ERROR [IM002] [Microsoft][ODBC Driver Manager] Data source name not found
and no default driver specified

If things are going poorly, please see the “Troubleshooting” section below.  However, if things went well, please proceed to the next section.

Export The Access Database As A Single JSON File

Now, let’s turn off the extraneous log messages and the superflous “START JSON” and “END JSON” tags.   Let’s also capture our output into a file so we can look at using our favorite text editor.

Use these commands to perform the export and view the output, again substituting in the correct name of your DSN and any output filename you wish:

C:\utilities>exportaccessdbtojson NorthwindTest -justjson > NorthwindJSON.txtC:\utilities>notepad NorthwindJSON.txt

If you would like to just see the expected JSON output from the Northwind database, please download the Zip file that contains that data as “NorthwindJSON.txt” here.

Next Steps

The next article talks about how to fit a simple relational database structure into the form needed by Cassandra today. The article after that expands on the concept to rearrange the Northwind JSON structures you just exported into a Cassandra-friendly format. Then, in a future articl,e I will show how to import the converted JSON into Cassandra.  Stay tuned!

How the Utility Works

This utility connects to the named Access database, lists all the tables, assumes the first column as the key in each database, and then exports the complete contents of each database as a series of string-based name value pairs, collected on a row-by-row basis by the value of the key column.  The “ID4Tables” argument is used to add unique IDs to tables that lack them: e.g., the “Order Details” table in Northwinds.  JSON structures created will include escaped special characters (e.g., “\\” to represent a backslash) but are not passed through a full JSON validation step before they are written so they are not 100% guaranteed to work in your JSON parser.  (These structures do, of course, work with the Northwind Access database.)

This utility can be used on databases other than the Northwind sample database but has not been extensively tested against many others – your mileage may vary.

Troubleshooting

Most problems with this process are caused by typos in your command-line invocation or in your DSN name in the ODBC setting.  Also make sure you set up a System, not a User DSN entry.

Also, please avoid DSNs with spaces or other characters.   The utility wasn’t tested against these.

 

Posted by on 2010-Jul-07 in Beginner, JSON, Northwind

Comments Off