RSS
 

Archive for the ‘Cassandra’ Category

Cassandra = Long Recovery Time?

20 Dec

A new interview with Facebook “infrastructure guru” Karthik Ranganathan on The Register (Facebook: Why our ‘next-gen’ comms ditched MySQL) adds fuel to a simmering reliability argument around Cassandra.

The Register writes: 

“For many, it’s surprising that Facebook didn’t put (its next-gen) messaging system on Cassandra, which was originally developed at the company and has since become hugely popular across the interwebs. But Ranganathan and others felt that it was too difficult to reconcile Cassandra’s ‘eventual consistency model’ with the messaging system setup.”

If you’re reading this site, you probably already know that Cassandra’s “eventual consistency model” is one that says, in quiet world, all nodes will “eventually” get all updates from all other nodes  and will the entire dataset will be “consistent” across all nodes.  Of course, no system is really quiet, so there’s some lag in the system, but reliability is generally good.

So what did Facebook use instead for this?  HBase, the open source distributed database modeled after Google’s proprietary BigTable platform.  Ranganathan continues:

“For a product like messaging, we needed a strong consistency model [as HBase offers]. If a user sends an email, he says, you have to be able to tell the user – immediately the email was sent. If it doesn’t show that it’s been sent, I think ‘Oh, I didn’t send it,’…”

Of course, this isn’t a problem if the same application that sent the message and/or its cache is maintaining the fact that the message was just sent, but it appears that Facebook wants to be free of that assumption.

But other than that, Cassandra’s still the cat’s pajama’s, right Karthik?

“(Ranganathan) also felt that the system needed HBase physical replication as opposed to Cassandra’s logical replication. With Cassandra, he says, if something gets corrupted and you lose a disk and you have to restore the data, you have to restore the entire replica. Because Facebook uses very dense machines, this would mean a very long recovery time.”

Now that’s a shot at the core of Cassandra’s eventually consistent model.  Who answered the challenge on behalf of Cassandra?

“Some Cassandra backers don’t see it that way.”

In other words…no one.  This is where folks like Riptano, a venture capitalist-backed commercial company who depends on Cassandra’s commercial success, or the leaders of the Apache Cassandra project need to stand up and make their voices heard.

Let’s hope Cassandra’s forthcoming 0.7 release gets a better reception in the press.

 
Comments Off

Posted in Cassandra, Uncategorized

 

Microsoft Jumps On Columnar Cloud Bandwagon, Provides Cloud Escrow

10 Nov

When we’re in a technical conversation about Business Intelligence (BI), the question about “which database do you use for BI” invariably comes up.  Whatever the database name is, chances are that the type of database will be described as “columnar“.  If you’re a frequent reader of this site, you may know that columnar and “NoSQL” databases are kissing cousins, and that we’re big fans of the Cassandra NoSQL database in these parts (though we advocate some tweaks).

We’re confident in our positions, but every once in a while its good to hear that we’re not just bleeding edge iconoclasts.  Today, Microsoft provided that reassurance when it announced its “Apollo” initiative to the masses.

In a Gavin Clarke interview published in The Register, Quentin Clark, general manager of the Microsoft SQL Server Database Systems Group, talks about, “new columnar technology called Apollo,” which Clark claimed could boost certain queries by between 10 and 50 times.

Other people were also struck by the new Apollo technology during a keynote Microsoft provided during the PASS Summit on Nov 9.  Here’s one blogger reacting:

” This is a great demo. We’re seeing a trillion rows per minute, filtered & reported on. It’s very slick. This is good. Same technology is also in the database engine. We’re seeing fantastic performance. I might be out of a job. It’s based on the columnar data store technology. It’s a very good thing.”

If you want to see the demo yourself, pull up this page in IE (you need Windows Media Player) and fast forward to about this point.

Though additional details on Apollo are sketchy so far, chances are that the fog will be lifted when the latest preview of Denali (the code name for the next version of SQL Server) is sent to subscribers on MSDN and TechNet, as Microsoft is promising near-parity of its on-premises and cloud-based SQL Server offerings.

Not lost on DivConq is the fact that by providing this level of parity between on-premises and cloud-based offerings Microsoft is giving its customers the ability to choose and later change their deployment models.  In other words, Microsoft is making cloud escrow a reality.  Who said they were evil?

 
Comments Off

Posted in Azure, Cassandra, Cloud, Other Organizations

 

Stored Procedures in Cassandra

18 Sep

It is a much discussed topic whether or not business logic should be present within a database. Academically the preference seems to be to separate the layers. However, one of the best performing applications I ever worked on placed the lion’s share of the business logic within the database – that was with another nosql database, MUMPS.

Out in the wild the reality is that there are many variations – sometimes business logic is distributed across many servers, sometimes it is all on one server. And sometimes stored procedures are used but contain a minimum of business logic and instead enhance the efficiency of data storage or retrieval.

It is my opinion that Cassandra should support an easy to use model of stored procedures and let the user decide the level of appropriateness for their software needs. One of the concerns often associated with stored procedures is the lack of portability amongst database vendors and the problem of database vendor lock-in. With Cassandra you have that concern with the data model anyway, so an investment in Cassandra suggests you may wish try to get the most out of what the software has to offer.

Another challenge of stored procedures is the often complex and unfamiliar language syntax/api. I find TSQL fairly reasonable to work with, for example, but it can be a jolt to work with it if you spend much time coding in Java or C#. The stored procedure solution we are developing for Cassandra will leverage the familiarity of Javascript.

With the emergence of Web 2.0 and HTML 5, Javascript is playing an ever more important role on the client. Along with this is a revived interest in Javascript on the server. With our enhancements to Cassandra we’ll offer a complete end-to-end model of developing applications using exclusively Javascript. Indeed, by blending the differences between web services and stored procedures our enhancements will provide a efficient and concise programming model that supports very high performance code.

And lets not forget about distributed computing, it is our primary focus after all. An obvious, but still beautiful, outcome of using Cassandra to *store* the stored procedures and web services (and web sites) is that you do not have to replicate your code across all the nodes – Cassandra already does that. Updating your web service or stored procedure is as simple as a single update or insert call to Cassandra.

What’s more is that we will provide the ability to version your stored procedures and web services. With this will come the ability to select which version is active. This feature enables a complete roll-out (replication) of a new version while the old version is still running, then flip a switch and all the nodes will start using the new code. Flip a switch again if you need to revert to the old version.

The addition of stored procedures and web services will open a lot of new possibilities with Cassandra and we are excited to be innovating in this area. A strong database is the key to a strong distributed system.

 
Comments Off

Posted in Cassandra, Elastic Architecture, MUMPS, nosql

 

Migrate a Relational Database into Cassandra (Part IV – Northwind Import)

11 Sep

This article shows how to prepare and import a dataset expressed in Cassandra-friendly JSON into a Cassandra datastore using Cassandra’s “json2sstable” utility.

Before proceeding, you should understand my previous “Part 3″ article on “Northwood Conversion” – this article imports the JSON dataset created in that article.

You should also have downloaded and installed either the .NET version or Mono version of the DivConq JSON command-line utilities, and should also have a complete JSON document from a conversion of the Northwind database export. (You can also start with the “JSONout7.txt” document from this archive.)

Cleaning the Data

So far everything we’ve done has simply moved data around.  This has led to a JSON structure that contains everything and then some from the original relational database.  We could import that, but from here on out we’ll treat this data more like a traditional data warehouse by only working with a subset of the original data.

To do the stripping, we can use a DivConq utility called “StripNodesFromJSON”.   The following batch snippet cuts out extra nodes (like “Shippers”) and tags (like “Phone”).

rem Let us turn this structure into something we can use in a data warehouse
rem Strip a lot of the extra tags out
rem x Get rid of the Shippers node
StripNodesFromJSON JSONout7.txt JSONout8.txt Shippers
rem x Get rid of extra nodes from the Employee node
StripNodesFromJSON JSONout8.txt JSONout9.txt PostalCode Photo Address ReportsTo HireDate HomePhone Notes BirthDate Extension
rem x Get rid of extra nodes from the Customer node
StripNodesFromJSON JSONout9.txt JSONout10.txt City Phone Region ContactTitle Address PostalCode Fax ContactName
rem x Get rid of extra nodes from the ItemEntry nodes
StripNodesFromJSON JSONout10.txt JSONout11.txt OrderID Product_UnitPrice Product_UnitsInStock Product_QuantityPerUnit Product_ReorderLevel Supplier_City Supplier_Region "Order Details_AutoID" Product_CategoryID Supplier_ContactTitle Supplier_ContactName Product_Discontinued Supplier_HomePage Supplier_PostalCode Supplier_Address Category_CategoryID  Category_Picture Category_Description Supplier_Fax Supplier_Phone
rem x Get rid of extra nodes from OrderInformation
StripNodesFromJSON JSONout11.txt JSONout12.txt OrderID, ShipPostalCode ShipCountry CustomerID EmployeeID

If you do a directory listing on the intermediate files created in this batch file you should see that each one is smaller than the one before it.

Cassandra’s JSON2SStable Format

If you’ve worked with Cassandra’s SStable2JSON utility, you’ve seen that the format Cassandra uses for its JSON datasets is not human-readable.

Cassandra’s SStable2JSON utility will export plain (no supercolumn) Column Families like this:

{
  "HotWheelsCar": [
    ["5072696365", "312e3439", 1278132336497000, false],
    ["53656374696f6e", "56656869636c6573", 1278132515996000, false]
  ],
  "GumDrop": [
    ["5072696365", "302e3235", 1278132306875000, false],
    ["53656374696f6e", "43616e6479", 1278132493790000, false]
  ]
}

…and will export supercolumn-filled Column Families like this:

{
  "ultralights": {
    "756c3233": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["7365617432", "392070656f706c65", 1283394499763000, false]
      ]
    }
  },
  "planes": {
    "706c616e65313436": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["726f773138", "372070656f706c65", 1283394371843000, false],
        ["726f773237", "322070656f706c65", 1283394387348000, false]
      ]
    },
    "706c616e65353436": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["726f773232", "332070656f706c65", 1283394349929000, false]
      ]
    }
  },
}

Several things are different than our JSON sets to date:

  • The data and supercolumn names are in hex rather than strings.  For example, instead of “Price”, you see “5072696365″ in the JSON above. (Use this to try it yourself.)
  • There are extra strings, such as “deletedAt” and “false”.  Fortunately, it appears that these can be faked up.
  • Columns are filed under “subColumns” node within each supercolumn entry.
  • JSON array structures are used in place of hierarchies.

…but it’s not impossible, it’s just different.

Creating JSON2SStable Format Files

Before we continue, we need to wrap our JSON datasets in one more node to represent the datastore – so far our top level has been column families, and the only remaining column family is now “Orders”.

Fortunately we can do this without a special utility: just a few lines of a batch file are needed to add a top-level “Northwind” node.

rem Add an extra wrapper for the name of the datastore
echo { "Northwind" : > JSONout12a.txt
type JSONout12.txt >> JSONout12a.txt
echo } >> JSONout12a.txt

Now we’re finally ready to use a DivConq utility to convert our human-readable JSON into the format needed by Cassandra’s JSON2SStable utility.  This part is easy.

rem Now convert wrapped dataset to json2sstable-ready
rem Cassandra array import format
PrepJSONForSSTableImport JSONout12a.txt JSONout13.txt

Now you should have a new, larger file filled with all the information Cassandra will need for its native import utility.

The Whole Export, Convert and Prep for Import Process

You may have noticed that the DivConq utilities ship with an “exportandimport.bat” file that performs all the steps covered so far.  Running this batch file should generate output like this.

C:\divconq\dotnet>exportandimport
22:38:39 Found expected organization in the "Orders" object.
22:38:39 Found expected organization in the "Order Details" object.
22:38:40 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:40 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:40 Found expected organization in the "Orders" object.
22:38:40 Found expected organization in the "Employees" object.
22:38:41 Completed OK.  Moved 830 children and found 0 orphans.
22:38:41 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:42 Found expected organization in the "Orders" object.
22:38:42 Found expected organization in the "Customers" object.
22:38:43 Completed OK.  Moved 830 children and found 0 orphans.
22:38:43 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:43 Found expected organization in the "Orders" object.
22:38:43 Found expected organization in the "Products" object.
22:38:45 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:45 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:46 Found expected organization in the "Orders" object.
22:38:46 Found expected organization in the "Suppliers" object.
22:38:48 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:48 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:50 Found expected organization in the "Orders" object.
22:38:50 Found expected organization in the "Categories" object.
22:38:52 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:53 Found expected organization in the "Orders" object.
22:38:55 Completed OK.  Moved 11620 children and found 0 orphans.
22:38:57 Completed OK.  Deleted 1 nodes.
22:38:59 Completed OK.  Deleted 9130 nodes.
22:39:01 Completed OK.  Deleted 6640 nodes.
22:39:02 Completed OK.  Deleted 43930 nodes.
22:39:03 Completed OK.  Deleted 4980 nodes.
22:39:08 Completed OK.  Did 39140 nodes.

This batch file and any of its commands can, of course, be modified to taste or to work with other datasets.

Importing Into Cassandra

If you simply run Cassandra’s JSON2SStable command you’ll see some short usage information.

C:\work\apache-cassandra-0.6.3>bin\json2sstable.bat
Missing required options: Kc
Usage: org.apache.cassandra.tools.SSTableImport -K keyspace -c column_family <j
on> <sstable>

…but please use the following procedure to properly import your JSON dataset.

First, shut down your Cassandra client and server (if started).  Then do into your Cassandra folder and open up your “conf\storage-conf.xml” file.  Add the following entry to this file and save.  (You can substitute your own name for “NorthwindOne” as long as you use it consistently below.)

<Keyspace Name="NorthwindOne">
  <ColumnFamily Name="Orders"
     CompareWith="UTF8Type"
     ColumnType="Super" CompareSubcolumnsWith="UTF8Type"
     />
   <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
   <ReplicationFactor>1</ReplicationFactor>
   <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
</Keyspace>

Once you’ve saved this file, start the Cassandra server again.  If your configuration changes were accepted, this will create a new, empty directory in your Cassandra server’s folder store.

You should also fire up the Cassandra client to check that your new datastore is live.

C:\work\apache-cassandra-0.6.3>bin\cassandra-cli --host localhost
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
cassandra> show keyspaces;
NorthwindOne
Keyspace1
system

Now, stop the Cassandra server again and shut down the Cassandra client again.  (The Cassandra client doesn’t respond well to the server going up and down.)

To properly invoke the JSON2SStable utility, use the following syntax, substituting the appropriate values and paths as necessary.

In the example below, “NorthwindOne” is the name of our keystore and must match the value we saved into the “conf\storage-conf.xml” file above.  “Orders” is the name of the new column family we will be creating and inserting our native-formatted JSON into.  The path to the “JSONout13.txt” file is, of course, the file we’re importing.  Finally, the path to the “Orders-1-Data.db” file indicates which Cassandra data file we will create.  Note that this file does not yet exist, but the rest of the path (the folder structure) must already be in place.

C:\work\apache-cassandra-0.6.3>bin\json2sstable.bat -K NorthwindOne -c Orders C
\divconq\dotnet\JSONout13.txt C:\var\lib\cassandra\data\NorthwindOne\Orders-1-D
ata.db

If this works correctly, it will take a few seconds to silently import the data and will then silently return you to the command prompt.  If you see any other output from this command, you encountered an error.

Another way to quickly confirm that data was imported successfully is to eyeball the Cassandra data directory.  This should now contain three new files: Order-1-Data.db, Order-1-Filter.db and Order-1-Index.db.

If you see entries like this, go ahead and fire up your Cassandra server and client again.

Working With Your Imported Data

Finally, it’s time to view the live data on the live Cassandra server. Try these commands first.

C:\work\apache-cassandra-0.6.3>bin\cassandra-cli --host localhost
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
cassandra> show keyspaces;
NorthwindOne
Keyspace1
system
cassandra> get NorthwindOne.Orders['10778']['OrderInformation']
=> (column=ShippedDate, value=12/24/1997 12:00:00 AM, timestamp=1269842588093)
=> (column=ShipVia, value=1, timestamp=1269842588093)
=> (column=ShipRegion, value=, timestamp=1269842588093)
=> (column=ShipName, value=Berglunds snabbk?, timestamp=1269842588093)
=> (column=ShipCity, value=Lule?, timestamp=1269842588093)
=> (column=ShipAddress, value=Berguvsv?gen  8, timestamp=1269842588093)
=> (column=RequiredDate, value=1/13/1998 12:00:00 AM, timestamp=1269842588093)
=> (column=OrderDate, value=12/16/1997 12:00:00 AM, timestamp=1269842588093)
=> (column=Freight, value=6.7900, timestamp=1269842588093)
Returned 9 results.
cassandra> get NorthwindOne.Orders['10778']
=> (super_column=OrderInformation,
     (column=Freight, value=6.7900, timestamp=1269842588093)
     (column=OrderDate, value=12/16/1997 12:00:00 AM, timestamp=1269842588093)
     (column=RequiredDate, value=1/13/1998 12:00:00 AM, timestamp=1269842588093)
     (column=ShipAddress, value=Berguvsv?gen  8, timestamp=1269842588093)
     (column=ShipCity, value=Lule?, timestamp=1269842588093)
     (column=ShipName, value=Berglunds snabbk?, timestamp=1269842588093)
     (column=ShipRegion, value=, timestamp=1269842588093)
     (column=ShipVia, value=1, timestamp=1269842588093)
     (column=ShippedDate, value=12/24/1997 12:00:00 AM, timestamp=1269842588093)
)
=> (super_column=ItemEntry_1393,
     (column=Category_CategoryName, value=Seafood, timestamp=1269842588093)
     (column=Discount, value=0, timestamp=1269842588093)
     (column=ProductID, value=41, timestamp=1269842588093)
     (column=Product_ProductID, value=41, timestamp=1269842588093)
     (column=Product_ProductName, value=Jack's New England Clam Chowder, timesta
mp=1269842588093)
     (column=Product_SupplierID, value=19, timestamp=1269842588093)
     (column=Product_UnitsOnOrder, value=0, timestamp=1269842588093)
     (column=Quantity, value=10, timestamp=1269842588093)
     (column=Supplier_CompanyName, value=New England Seafood Cannery, timestamp=
1269842588093)
     (column=Supplier_Country, value=USA, timestamp=1269842588093)
     (column=Supplier_SupplierID, value=19, timestamp=1269842588093)
     (column=UnitPrice, value=9.6500, timestamp=1269842588093))
=> (super_column=Employee,
     (column=Country, value=USA, timestamp=1269842588093)
     (column=FirstName, value=Janet, timestamp=1269842588093)
     (column=LastName, value=Leverling, timestamp=1269842588093)
     (column=Title, value=Sales Representative, timestamp=1269842588093)
     (column=TitleOfCourtesy, value=Ms., timestamp=1269842588093))
=> (super_column=Customer,
     (column=CompanyName, value=Berglunds snabbk?, timestamp=1269842588093)
     (column=Country, value=Sweden, timestamp=1269842588093))
Returned 4 results.

You can pick other Order IDs and supercolumn values (e.g., “Customer”, “Employee”, various “ItemEntry_” values) to view those values too.

Next Steps

At this point you have the tools and documentation to not only import the Microsoft Northwind Access database into Cassandra, but similar databases as well. This concludes the “Migrate a Relational Database into Cassandra” series of articles.

The next set of articles will describe how to build a working application on top of Cassandra.

Troubleshooting

If you encounter errors during import, feel free to shut down the server and wipe all the data files from the folder.

Also look for entries like this in the “C:\var\log\cassandra\system.log” file; while this specific instance indicates an import problem related to importing extra data through this process, these errors are really telling you that the “Trains-2-Data.*” files are useless but that the older “Trains-1-Data.*” files are still good.

 INFO [main] 2010-09-02 21:48:20,172 SSTableReader.java (line 120) Sampling index for C:\var\lib\cassandra\data\TransSchedTwo\Trains-1-Data.db
 INFO [main] 2010-09-02 21:48:20,177 SSTableReader.java (line 120) Sampling index for C:\var\lib\cassandra\data\TransSchedTwo\Trains-2-Data.db
ERROR [main] 2010-09-02 21:48:20,183 ColumnFamilyStore.java (line 182) Corrupt file C:\var\lib\cassandra\data\TransSchedTwo\Trains-2-Data.db; skipped
 
Comments Off

Posted in Cassandra, Intermediate, JSON, Northwind

 

Resolving JSON data types in Cassandra

02 Sep

In my previous entry I ask Why does Apache Cassandra have data types? So now that the rationale has been covered, what does that mean to us?

At DivConq we like to move data around with JSON, as you can probably tell. So the question arises how do we want JSON data types to work with Cassandra?

Read the rest of this entry »

 
Comments Off

Posted in Cassandra, JSON, MUMPS

 

What’s missing from Cassandra and Thrift?

24 Aug

If you’ve spent any time at all with Cassandra you’ve gotten to know the Thrift interface that comes with that innovative NoSQL database.

As you may know, Thrift is a framework that allows you to serialize, transport and unpack data objects from a variety of different development environments, including C++, Java, Python, PHP, Perl, C# and Cocoa.  Like Cassandra, it was developed by Facebook and then donated to Apache.  Though not explicitly a part of Thrift, JSON parallels many of these same concepts, including strongly typed object serialization across the same development environments (and then some).

Facebook has made it quite far in this world with these technologies, but there are also some additional technologies DivConq would like to see embedded in Cassandra to increase its acceptance in the overall development community.

First up are native web services.   Today, you need to have a Thrift client on the remote end of a Cassandra to parse and send binary data streams.   You’re also encouraged to manipulate individual full-blown objects, when maybe what you really want to do is hit a range  in the back-end database.

Second are sort orders.  The key sorts (on row keys and supercolumns) in Cassandra are currently limited to a single type.  That works great if your keys are really all of the same type, but less well if you have a few numbers or other types you want to mix in there.  (Several other NoSQL databases support sorts with mixed types; I know DivConq co-creator Andy White misses these too.)

Finally, there is the missing concept of “stored procedures” which are quite common in relational databases.  The lack of these capabilities force common and reusable data selection and manipulation operations back up into the application layer.  This in turn forces the applications to get involved in “operations replication” in a distributed environment…and that’s really Cassandra’s job.

So…what to do?  Stay tuned to DivConq – we’ll be addressing all three of these challenges soon.

 
Comments Off

Posted in Cassandra, Thrift

 

Why relational DBs are not the best choice for logs

22 Aug

In my role as developer, architect and then product manager for a couple of different server-based software packages, there was a common mistake I found myself making with log records over and over again in the late 1990′s and into the 2000′s.

Most of these server-based packages logged to a relational database table.  The most advanced packages used a single-gate tamper-evident chain of hashes on log entries as they were written.  Write performance was OK and read performance, especially across the 2-6 indicies that might span a database, was usually good too…on a quiet system.

However there were three problems with the use of a relational database table for my logging.

First, there was a blocking insert problem.  Whenever something noteworthy happened on my systems (e.g., a sign on, a file upload, an administrative configuration change, etc.) I logged it.  As long as I didn’t have a busy system things were generally fine, but if a couple of different people hit me with extended periods of rapid file uploads, sign-in/offs from unthrottled API clients then my software would shudder and sometimes thrash.

Second, there was an oversubscription problem, where I added even more load onto the log database by using it heavily for common, interactive queries, such as looking back across the log for recent sign-ons.  While that sounds like a good idea because there would only be one authoritative set of records to check, it also magnified the effect of my blocking insert problem.  (e.g., if I got hit with a lot of sign-ons, the act of recording the sign-on in the log would block and slow other sign-ons too.)

Finally, the most serious problem occurred when it was time to upgrade.  My upgrades often involved a schema change in the log database, and that meant I needed to lock the database and update all the log records – often tens of millions of records. This was too frequently an operation that could take hours and often took 100x more time to complete than all other upgrade operations combined.

So…what should I have done?  One answer would have been to look at non-relational NoSQL database technology (such as that available in Apache Cassandra) for my log tables instead.  That would have addressed:

  • the blocking insert problem: nosql databases, especially distributed nosql databases like Cassandra, do not wait for inserts.
  • the oversubscription problem: without delays due to blocking inserts, the problem of lots of reads waiting on blocking inserts goes away
  • schema changes: NoSQL datasets support data of various formats, allowing old and new schema data to live next to each other and preventing outages caused by touching all existing data. (The multiple schemas put a little more burden on the application to keep these straight, but it allows the application to handle multiple versions and/or upgrade old ones in the background without downtime.)

.

Would I be alone in switching away from a relational database for this application?  No, I wouldn’t.  Today, many, if not most, data warehouse products (such as Sybase IQ and Infobright) use non-relational “columnar” database structures because those types of databases avoid the problems I just stated.

SaaS-based services such as Reddit have also embraced non-relational database technology.  In a recent article Todd Hoff repeats some of the problems I encountered and lauds the NoSQL solution Reddit deployed instead.  (Click here to view article, scroll to “Lesson 3: Open Schema”)

  • “Schema updates are very slow when you get bigger. Adding a column to 10 million rows takes locks and doesn’t work.”
  • “When (Reddit adds) new features they (don’t) have to worry about the database anymore. They (don’t) have to add new tables for new things or worry about upgrades. Easier for development, deployment, maintenance.”
 
Comments Off

Posted in Cassandra

 

Migrate a Relational Database into Cassandra (Part III – Northwind Conversion)

16 Aug

This article describes how to convert a JSON document containing the complete contents of the Northwind Access database into Cassandra-friendly JSON.

Before proceeding, you should understand my previous “Part 2″ article on “Northwind Planning” – this article implements that plan.

You should also have downloaded and installed either the .NET version or Mono version of the DivConq JSON command-line utilities, and should also have a complete JSON document from a conversion of the Northwind database export.  (You can also start with the “NorthwindJSON.txt” document from this archive.)

What To Do?

Copy the following batch file off into the folder that contains your DivConq command-line utilities and your Northwind JSON document.  Then open it using your favorite text editor and skip down to the “Understanding the Flow” section below.

@echo off
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem X
rem X This batch file uses several divconq.com utilities. First it exports an Access DB
rem X called "Northwind" into a single large JSON structure.  Then it manipulates the
rem X JSON structure until it conforms to a denormalized form suitable for insertion
rem X into a Cassandra DB.  Finally, it imports the JSON structure into a Cassandra DB.
rem X
rem X August 11, 2010 - Jonathan Lampe - jonathan.lampe@divconq.com
rem X
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the basic schema of the original DB
rem .
rem Categories
rem Customers
rem Employees
rem Order Details
rem   ProductID
rem   OrderID
rem Orders
rem   CustomerID
rem   EmployeeID
rem Products
rem   SupplierID
rem   CategoryID
rem Suppliers
rem .
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the data schema represented with 1:N and N:1 relationships
rem
rem    Customer }                             { Supplier
rem    Employee } 1:N Orders N:N Products N:1 { Category
rem
rem ...where the N:N relationship is expressed in the "Order Details" table
rem
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the denormalized, Cassandra-friendly schema that is constructed:
rem .
rem Keyspace=Northwind
rem   ColumnFamily=Orders
rem     RowKey=OrderID
rem       1 SuperColumn="OrderInformation"
rem         Columns=(contents from one Orders entry)
rem       1 SuperColumn="Customer"
rem         Columns=(contents from one Customers entry)
rem       1 SuperColumn="Employee"
rem         Columns=(contents from one Customers entry)
rem       0-N SuperColumns="ItemEntry_"+[NumericCount]+"
rem         Columns=(contents from one Order Details entry) +
rem                 (contents from one Product entry)
rem                 (contents from one Supplier entry)
rem                 (contents from one Category entry)
rem .
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Bring in 0-N item entries per order, each as a new supercolumn
consolidatejsonobjects NorthwindJSON.txt JSONout.txt Orders "Order Details" ItemEntry_[ID] "OrderID" yes yes 4
rem .
rem Bring in exactly one employee per order as a new supercolumn
distributejsonobjects JSONout.txt JSONout2.txt Orders "Employees" Employee "EmployeeID" yes yes 4
rem .
rem Bring in exactly one customer per order as a new supercolumn
distributejsonobjects JSONout2.txt JSONout3.txt Orders "Customers" Customer "CustomerID" yes yes 4
rem .
rem Bring in exactly one product entry per item entry as new columns in existing supercolumn
distributejsonobjects2 JSONout3.txt JSONout4.txt Orders ItemEntry_* Products Product_ ProductID yes yes 4
rem .
rem Bring in one supplier entry and one category entry per item entry as new columns in existing supercolumn
rem   (note these depend on previously imported Product entries)
distributejsonobjects2 JSONout4.txt JSONout5.txt Orders ItemEntry_* Suppliers Supplier_ Product_SupplierID yes yes 4
distributejsonobjects2 JSONout5.txt JSONout6.txt Orders ItemEntry_* Categories Category_ Product_CategoryID yes yes 4
rem .
rem Demote remaining rowkey-columns into a new supercolumn
rem .
demotejsonpairsintonewobject JSONout6.txt JSONout7.txt Orders OrderInformation yes yes 4

Understanding the Flow

The top part of the batch file simply reinterates what we planned in the previous article.  The end is where all the processing actually takes place.  Within the “rem”ed lines there are 7 JSON conversions, each performing a single piece of the consolidation.

Step 1: We open the list of “Order Details” and pull in zero, one or multiple entries as new “ItemEntry_[ID]” nodes.

Steps 2 and 3: We reach out and import exactly one Customer node and one Employee node into our master Order record.

(Note that Steps 1, 2 and 3 can really occur in any order because these three entities are completely independent.  However, “Order Details” must have been imported before Step 4 can take place.)

Step 4: We reach out and import the contents of exactly one Product node into each existing ItemEntry node by looking up ProductIDs from existing ItemEntry name/value pairs.

(Note that Step 4 – the Product import – must take place before Steps 5-6 can take place.)

Steps 5-6: We reach out and import the contents of exactly one Category node and exactly one Supplier node into each existing ItemEntry node by looking up CategoryIDs and SupplierIDs from recently imported ItemEntry > Product information.

Step 7: Consolidate the remaining entries under the main Order entry into a new OrderInformation node.

(Step 7 – the consolidation of hanging entries – should always go last in case previous steps depend on keys in their original positions.)

To see complete sample output from each step of the conversion (as generated by the .NET utilities), unpack the archive here.

Running the Batch File

You may need to change the name of the original “NorthwindJSON.txt”  and/or may want to change the name of the final output file (from “JSONout7.txt”), but other than that the batch file should be ready to run.  To capture log and error output so it’s easier to read, use this command:

D:\divconq\dotnet>ConsolidateNorthwindJSON.bat > log.txt 2> err.txt

Next Steps

The next article in this series shows how to import the complete JSON document into Cassandra.

More About The Utilities

ConsolidateJSONObjects is used to dive into a foreign table used to express an N:N relationship from the point of view of one entity in the relationship.  It generates 0-N supercolumn-level JSON nodes for each foreign node discovered.  This utility is used once during the Northwind conversion to bring in 0-N item entries per order.

consolidatejsonobjects NorthwindJSON.txt JSONout.txt Orders "Order Details" ItemEntry_[ID] "OrderID" yes yes 4

DistributeJSONObjects and DistributeJSONObjects2 are used to select foreign keys in the main table and look them up against a foreign table: it works N:1 relationships and brings in exactly one node for each foreign key scanned (assuming DB consistency, of course).   The difference between these utilities is that DistributeJSONObjects imports discovered nodes as supercolumn-level nodes while DistributeJSONObjects2 imports the contents of discovered nodes as column-level name-value pairs under an existing supercolumn-level node.  These utilities are used five times during the Northwind conversion: two invocations of DistributeJSONObjects to import 1-per-order Employee and Customer nodes as supercolumns and three invocations of DistributeJSONObjects2 to import the contents of 1-per-itementry Product, Supplier and Category nodes.

rem Bring in exactly one employee per order as a new supercolumn
distributejsonobjects JSONout.txt JSONout2.txt Orders "Employees" Employee "EmployeeID" yes yes 4
rem .
rem Bring in exactly one customer per order as a new supercolumn
distributejsonobjects JSONout2.txt JSONout3.txt Orders "Customers" Customer "CustomerID" yes yes 4
rem .
rem Bring in exactly one product entry per item entry as new columns in existing supercolumn
distributejsonobjects2 JSONout3.txt JSONout4.txt Orders ItemEntry_* Products Product_ ProductID yes yes 4
rem .
rem Bring in one supplier entry and one category entry per item entry as new columns in existing supercolumn
rem   (note these depend on previously imported Product entries)
distributejsonobjects2 JSONout4.txt JSONout5.txt Orders ItemEntry_* Suppliers Supplier_ Product_SupplierID yes yes 4
distributejsonobjects2 JSONout5.txt JSONout6.txt Orders ItemEntry_* Categories Category_ Product_CategoryID yes yes 4

DemoteJSONPairsIntoNewObject is simply used to clean up lingering columns (name/value pairs) at the “row” level and put them into a new supercolumn-level object.   One invocation of this utility is used during the Northwind conversion.

demotejsonpairsintonewobject JSONout6.txt JSONout7.txt Orders OrderInformation yes yes 4

All utilities will display additional help, including complete documentation for each parameter at the command-line if invoked without any command-line parameters.  (e.g., simply execute “consolidatejsonobjects.exe” to find out what the eighth parameter – “yes” in the example above – means)

Troubleshooting

If you are experimenting with these utilities and data sets of any size it is highly recommended that you put all your commands in a batch file.  Then, from the command prompt, run your batch file like this:

D:\divconq\dotnet>divconq_test23.bat > log.txt 2> err.txt

…to capture logging output and error output in two separate files.

 
Comments Off

Posted in Cassandra, Intermediate, JSON, Northwind

 

Why does Cassandra have data types?

04 Aug

You have probably noticed that Apache Cassandra, through Thrift, exposes it’s column names, column values and super column names as binary. See the Thrift declarations:

struct Column {
   1: required binary name,
   2: required binary value,
   3: required Clock clock,
   4: optional i32 ttl,
}

struct SuperColumn {
   1: required binary name,
   2: required list columns,
}

On the other hand the column family name, the row key and the keystore name are all type string. See the ColumnPath Thrift declaration:

struct ColumnPath {
    3: required string column_family,
    4: optional binary super_column,
    5: optional binary column,
}

And the ‘get’ method Thrift declaration:

ColumnOrSuperColumn get(1:required string keyspace,
                          2:required string key,
                          3:required ColumnPath column_path,
                          4:required ConsistencyLevel consistency_level=ONE)

Yet, when declaring a Column Family in the config file you do give a data type:

<ColumnFamily Name="Regular1" CompareWith="LongType" />

Which means that the column name is type long.
Read the rest of this entry »

 
Comments Off

Posted in Cassandra, Thrift

 

Migrate a Relational Database into Cassandra (Part II – Northwind Planning)

29 Jul

This article describes how to fit a relational database structure containing an N:N relationship and several N:1 relationships into a Cassandra-friendly structure.

You should be familiar with the concepts in my “Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)” article before proceeding.

In this article we are going to examine the “Northwind” database shipped as an example with Microsoft Access.  Having a local copy of this database not needed to understand this article, but if you are interesting in obtaining a copy (and/or want to see how to extract it to JSON), visit this article and then return.

Northwind Database Schema

The relationships of the various tables that make up the Northwind Access database can be summed up like this:

Customer }                            { Supplier
Employee } 1:N Order N:N Products N:1 { Category

In other words:

  • Every order is tied to a single customer and a single employee
  • Every product is tied to a single supplier and a single category
  • Every order can have a variety of different products
  • Every product can be part of a variety of different orders

Notice that I’m ignoring the “N” part of the customer, employee, supplier and category relationships for now.  I’m also ignoring the “Order Details” table that provides the links necessary to implement the “N:N” relationship depicted above.

Instead, I’m focusing on the important “N:N” relationship between order and product.  In a relational database setting I can often simply normalize away and worry about which table I need to focus my indexing efforts on later when I’m working in the application.  However, in NoSQL, non-relational database design, we often need to decide up front which entity most queries will be interested in and build everything else around that entity.

So…will it be “order” or “product”?  Today I’ll decide that the key entity in this database is “order” – customers will be hitting this on a daily, per transaction basis whereas I can probably run my product reports offline.

Sketching Out a Viable Cassandra Schema for the Northwind Database

OK, so “order” is our key element.  Let’s see how we can squeeze that into a Cassandra-friendly data structure.  First, let’s review Cassandra’s keystore->column family->row->super column->column data structure.  (If you need more of a refresher, see this.)

{
  "keystore" : {
    "column family" : {
      "row key" : {
        "supercolumn key" : {
          "column name" : "value"
        }
      }
    }
  }
}

Remember that the first two levels of this structure (keystore->column family) will essentially be burned up with the name of the database and the name of table.  That leaves us one level (row) for each “order” entry and one level (supercolumn) for each subcomponent of each “order” entry (such as a block of employee information).  The remaining level (column) has to be used for the name/value pairs that contain our information.

That’s a tight fit, but not an impossible one.  Let’s get started on the side of the relational diagram that only contains N:1 relationships.

Customer }
Employee } 1:N Order

Again, we’re going to ignore the fact that a single customer can be tied to multiple orders and that a single employee can also be tied to multiple orders.  Instead, we’re going to focus on the fact that a single order has one customer and one employee – period.

Here’s a data structure that accommodates that line of thinking.

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "OrderInformation" : {
          "column name" : "value"
        }
        "Employee" : {
          "column name" : "value"
        }
        "Customer" : {
          "column name" : "value"
        }
      }
    }
  }
}

So far, so good.  But where did that extra “OrderInformation” entry come from?  The answer is that we demoted the original information from the original order record so that column name/value pairs are always at the same level in our Cassandra database.  (More on that.)

Tackling the N:N Relationship

Now what should we do about the “N:N” relationship from orders to products?

                       { Supplier
Order N:N Products N:1 { Category

We could call each N:N Order:Product relationship an “ItemEntry” and consider doing something like this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntry_ID]" : {
          "ItemInformation" : {
            "column name" : "value"
          }
          "Product" : {
            "column name" : "value"
            "Category" : {
               "column name" : "value"
             }
            "Supplier" : {
               "column name" : "value"
             }
          }
        }
      }
    }
  }
}

…but that would be illegal because it requires two more levels than Cassandra allows (e.g., “Product > Category”).

Instead we need a structure that looks more like this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntryID]_ItemInformation" : {
            "column name" : "value"
          }
          "[ItemEntryID]_Product" : {
            "column name" : "value"
          }
          "[ItemEntryID]_ProductSupplier" : {
            "column name" : "value"
          }
          "[ItemEntryID]_ProductCategory" : {
            "column name" : "value"
          }
        }
      }
    }
  }
}

…or this:

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "[ItemEntryID]" : {
            "ItemInformation_column1" : "value",
            "ItemInformation_column2" : "value",
            "Product_column1" : "value",
            "Product_column2" : "value",
            "ProductSupplier_column1" : "value",
            "ProductSupplier_column2" : "value",
            "ProductCategory_column1" : "value",
            "ProductCategory_column2" : "value"
          }
        }
      }
    }
  }
}

It looks good on paper but there’s an extra problem with the “Order Details” database we’re using to provide the N:N relationship in the Northwind database.  The extra problem is that there are no unique IDs on the Order Detail entries.  With that in mind, we’ll need to take special care during export/consolidations of legacy Order Detail data and with new entries in the future to automatically populate unique entries for each Order Detail as we add it.  (This isn’t a big deal – we could either go with unique counts – e.g., 1, 2, 3… – within each Order entry or commit to the use of GUIDs for the same purpose)  This exercise uses unique IDs added to the original Northwind JSON document through the use of the “ID4Tables” option in the ExportAccessDBToJSON utility.

With all that in place, here is our final, Cassandra-ready data structure with an “order”-centric focus and all the information from our original Northwind relational structure.

{
  "Northwind" : {
    "Orders" : {
      "[OrderID]" : {
        "OrderInformation" : {
          "column name" : "value"
        }
        "Employee" : {
          "column name" : "value"
        }
        "Customer" : {
          "column name" : "value"
        }
        "ItemEntry_[ItemEntryID]" : {
            "ItemInformation_column1" : "value",
            "ItemInformation_column2" : "value",
            "Product_column1" : "value",
            "Product_column2" : "value",
            "ProductSupplier_column1" : "value",
            "ProductSupplier_column2" : "value",
            "ProductCategory_column1" : "value",
            "ProductCategory_column2" : "value"
        }
      }
    }
  }
}

Next Steps

Next I will show exactly how to use DivConq utilities to implement our plan and convert our Northwind JSON to a Cassandra-ready format.

 
Comments Off

Posted in Cassandra, Intermediate, Northwind