RSS

Monthly Archives: August 2010

Gartner warns about increasing power of cloud vendors

In a recent publication (“Q&A SaaS FAQs” – August 23, 2010) Gartner analyst Ben Pring acknowledged the increasing power of cloud vendors and the diminishing power of hosted application providers and IT departments who buy these services.

“Pure-play SaaS providers are now having to fight on the turf of the (Tier 1 ISV) incumbent, rather than the other way around.  Given that incumbents typically enjoy much-greater scale in marketing and sales and technical R&D capability, the next chapter of the SaaS story will be about the large Tier 1 incumbents determining how they want the evolution of the ‘as a service’ model to develop.”

“While demand-side considerations – that is, the desire by end users to deploy and manage applications in a lower-cost and more agile way – are of course important…the market ‘control’ that the large incumbent providers have cannot be overlooked.”

(My emphasis added.)

So what should developers, architects and IT leaders interested in cloud deployments do to avoid getting squeezed?  The answer is to understand the concept of “cloud escrow” and build on technology that can be migrated between cloud providers or even back into your own data center.

DivConq will continue to cover a number of these technologies (such as Cassandra) and provide advice and technology of its own to bridge gaps in a cloud-portable technology platform.  Stay tuned!

 

Posted by on 2010-Aug-08 in Uncategorized

Comments Off

What’s missing from Cassandra and Thrift?

If you’ve spent any time at all with Cassandra you’ve gotten to know the Thrift interface that comes with that innovative NoSQL database.

As you may know, Thrift is a framework that allows you to serialize, transport and unpack data objects from a variety of different development environments, including C++, Java, Python, PHP, Perl, C# and Cocoa.  Like Cassandra, it was developed by Facebook and then donated to Apache.  Though not explicitly a part of Thrift, JSON parallels many of these same concepts, including strongly typed object serialization across the same development environments (and then some).

Facebook has made it quite far in this world with these technologies, but there are also some additional technologies DivConq would like to see embedded in Cassandra to increase its acceptance in the overall development community.

First up are native web services.   Today, you need to have a Thrift client on the remote end of a Cassandra to parse and send binary data streams.   You’re also encouraged to manipulate individual full-blown objects, when maybe what you really want to do is hit a range  in the back-end database.

Second are sort orders.  The key sorts (on row keys and supercolumns) in Cassandra are currently limited to a single type.  That works great if your keys are really all of the same type, but less well if you have a few numbers or other types you want to mix in there.  (Several other NoSQL databases support sorts with mixed types; I know DivConq co-creator Andy White misses these too.)

Finally, there is the missing concept of “stored procedures” which are quite common in relational databases.  The lack of these capabilities force common and reusable data selection and manipulation operations back up into the application layer.  This in turn forces the applications to get involved in “operations replication” in a distributed environment…and that’s really Cassandra’s job.

So…what to do?  Stay tuned to DivConq – we’ll be addressing all three of these challenges soon.

 

Posted by on 2010-Aug-08 in Cassandra, Thrift

Comments Off

Why relational DBs are not the best choice for logs

In my role as developer, architect and then product manager for a couple of different server-based software packages, there was a common mistake I found myself making with log records over and over again in the late 1990’s and into the 2000’s.

Most of these server-based packages logged to a relational database table.  The most advanced packages used a single-gate tamper-evident chain of hashes on log entries as they were written.  Write performance was OK and read performance, especially across the 2-6 indicies that might span a database, was usually good too…on a quiet system.

However there were three problems with the use of a relational database table for my logging.

First, there was a blocking insert problem.  Whenever something noteworthy happened on my systems (e.g., a sign on, a file upload, an administrative configuration change, etc.) I logged it.  As long as I didn’t have a busy system things were generally fine, but if a couple of different people hit me with extended periods of rapid file uploads, sign-in/offs from unthrottled API clients then my software would shudder and sometimes thrash.

Second, there was an oversubscription problem, where I added even more load onto the log database by using it heavily for common, interactive queries, such as looking back across the log for recent sign-ons.  While that sounds like a good idea because there would only be one authoritative set of records to check, it also magnified the effect of my blocking insert problem.  (e.g., if I got hit with a lot of sign-ons, the act of recording the sign-on in the log would block and slow other sign-ons too.)

Finally, the most serious problem occurred when it was time to upgrade.  My upgrades often involved a schema change in the log database, and that meant I needed to lock the database and update all the log records – often tens of millions of records. This was too frequently an operation that could take hours and often took 100x more time to complete than all other upgrade operations combined.

So…what should I have done?  One answer would have been to look at non-relational NoSQL database technology (such as that available in Apache Cassandra) for my log tables instead.  That would have addressed:

  • the blocking insert problem: nosql databases, especially distributed nosql databases like Cassandra, do not wait for inserts.
  • the oversubscription problem: without delays due to blocking inserts, the problem of lots of reads waiting on blocking inserts goes away
  • schema changes: NoSQL datasets support data of various formats, allowing old and new schema data to live next to each other and preventing outages caused by touching all existing data. (The multiple schemas put a little more burden on the application to keep these straight, but it allows the application to handle multiple versions and/or upgrade old ones in the background without downtime.)

.

Would I be alone in switching away from a relational database for this application?  No, I wouldn’t.  Today, many, if not most, data warehouse products (such as Sybase IQ and Infobright) use non-relational “columnar” database structures because those types of databases avoid the problems I just stated.

SaaS-based services such as Reddit have also embraced non-relational database technology.  In a recent article Todd Hoff repeats some of the problems I encountered and lauds the NoSQL solution Reddit deployed instead.  (Click here to view article, scroll to “Lesson 3: Open Schema”)

  • “Schema updates are very slow when you get bigger. Adding a column to 10 million rows takes locks and doesn’t work.”
  • “When (Reddit adds) new features they (don’t) have to worry about the database anymore. They (don’t) have to add new tables for new things or worry about upgrades. Easier for development, deployment, maintenance.”
 

Posted by on 2010-Aug-08 in Cassandra

Comments Off

Google leader says cloud deployment is not the complete answer

At first I was tempted to not post the analysis performed by Google’s Vijay Gill which concludes that an on-premise deployment may be cheaper than an Amazon cloud deployment if usage is high and constant.

It’s not because I don’t agree, but because the statement seems quite obvious to anyone coming from with any operational background.

In general, if demand is constant and predictable, it makes sense to apply fixed resources such as in-house servers or full time employees against the problem.  If demand is variable or unpredictable, it makes sense to invest more in variable resources such as use-as-you-go cloud resources and seasonal employees.

But ultimately I decided to post Gill’s analysis because it allows me to remind people that most demand models have both a fixed and variable component.

Furthermore, if fixed on-premises and variable cloud deployment models are the resources of the future, doesn’t it make sense to design hybrid applications today that span across both types of resources: your in-house datacenters and the cloud?

Elastic architecture provides you the scalability you need to span; cloud escrow provides you the ability to choose your deployment model.  Stay tuned to DivConq as we continue to explore these concepts and the technologies behind them.

 

Posted by on 2010-Aug-08 in Amazon EC2, Cloud

Comments Off

DivConq to present at Pecha Kucha Happy Hour today

Jonathan Lampe and Andy White will be presenting “Next Generation Web Servers” at the popular IT-oriented Pecha Kucha Happy Hour at the University of Wisconsin today.  This presentation format will involve 20 slides, each only displayed for 20 seconds, and will quickly cover Nginx, Jetty, Kayak and lighttp’s roles as logical replacements for IIS and Apache in elastic architecture.

If you cannot see this presentation in person, you may want to view an earlier recording of this topic here, or read the original “Web Server Threading Models” article.

 

Posted by on 2010-Aug-08 in DivConq, Elastic Architecture, Events, Jetty, Kayak, lighttpd, Nginx

Comments Off

Are Amazon’s various “elastic” services elements of an elastic architecture?

If you’ve looked at Amazon’s published web services lately, you’ve noticed that there are a number of “elastic” services available from their cloud.  These include Elastic Block Store (EBS), Elastic Computer Cloud (EC2), Elastic MapReduce and Elastic Load Balancing.

Amazon’s definition of “elastic” centers on massive (through thousands of instances) horizontal scaling of “as many or as few compute instances running (your application) as you want.”  Startup of additional elastic instances is a manual process (which can be largely automated through APIs), and startup time can range from minutes (e.g., for EC2 and MapReduce instances) to longer periods (e.g., EBS and Load Balancing).

Unfortunately, Amazon has not been entirely consistent with their “elastic” tag.  Although all four of those services fit this definition there are several other elastic services worth mentioning in the context of elastic architecture, such as SimpleDB and Simple Queue Service (SQS).   That said, there are elastic choices at most architectural tiers of Amazon’s portfolio, from raw block access (EBS), to logical storage (S3) and database storage (SimpleDB and RDS), at the messaging layer (SQS and SNS) and at the start of the application layer (EC2).

So…if Amazon’s got a comprehensive elastic stack, what’s not to like?  Two things*:

1) The threat of vendor lock-in. If you want to move your SimpleDB-based application to Rackspace someday, how are you going to do that?  If Amazon wants to raise usage rates 50% year-over-year down the road what are you going to do about it?

2) The threat of cloud reversal.  If you get a new CIO, new regulations, acquire/divest major IT resources or just need to fill up some idle racks, how can you pull some or all of your application down from the cloud and back into on premise datacenters?

What’s the answer to these threats?  First, understand the concept of “cloud escrow“.  Second, learn how to build around a cloud-portable elastic architecture and how that helps you negotiate with cloud vendors.  Third, keep reading DivConq.com as we continue to explore these issues!

* = OK, in the “not to like” department, there’s also manual start-up of elastic instances and a weak application/presentation layer – but there are ways around those – we’ll cover these on DivConq in a future article.

 

Posted by on 2010-Aug-08 in Amazon EC2, Elastic Architecture, Other Organizations

Comments Off

Migrate a Relational Database into Cassandra (Part III – Northwind Conversion)

This article describes how to convert a JSON document containing the complete contents of the Northwind Access database into Cassandra-friendly JSON.

Before proceeding, you should understand my previous “Part 2″ article on “Northwind Planning” – this article implements that plan.

You should also have downloaded and installed either the .NET version or Mono version of the DivConq JSON command-line utilities, and should also have a complete JSON document from a conversion of the Northwind database export.  (You can also start with the “NorthwindJSON.txt” document from this archive.)

What To Do?

Copy the following batch file off into the folder that contains your DivConq command-line utilities and your Northwind JSON document.  Then open it using your favorite text editor and skip down to the “Understanding the Flow” section below.

@echo off
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem X
rem X This batch file uses several divconq.com utilities. First it exports an Access DB
rem X called "Northwind" into a single large JSON structure.  Then it manipulates the
rem X JSON structure until it conforms to a denormalized form suitable for insertion
rem X into a Cassandra DB.  Finally, it imports the JSON structure into a Cassandra DB.
rem X
rem X August 11, 2010 - Jonathan Lampe - jonathan.lampe@divconq.com
rem X
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the basic schema of the original DB
rem .
rem Categories
rem Customers
rem Employees
rem Order Details
rem   ProductID
rem   OrderID
rem Orders
rem   CustomerID
rem   EmployeeID
rem Products
rem   SupplierID
rem   CategoryID
rem Suppliers
rem .
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the data schema represented with 1:N and N:1 relationships
rem
rem    Customer }                             { Supplier
rem    Employee } 1:N Orders N:N Products N:1 { Category
rem
rem ...where the N:N relationship is expressed in the "Order Details" table
rem
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Here is the denormalized, Cassandra-friendly schema that is constructed:
rem .
rem Keyspace=Northwind
rem   ColumnFamily=Orders
rem     RowKey=OrderID
rem       1 SuperColumn="OrderInformation"
rem         Columns=(contents from one Orders entry)
rem       1 SuperColumn="Customer"
rem         Columns=(contents from one Customers entry)
rem       1 SuperColumn="Employee"
rem         Columns=(contents from one Customers entry)
rem       0-N SuperColumns="ItemEntry_"+[NumericCount]+"
rem         Columns=(contents from one Order Details entry) +
rem                 (contents from one Product entry)
rem                 (contents from one Supplier entry)
rem                 (contents from one Category entry)
rem .
rem xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
rem .
rem Bring in 0-N item entries per order, each as a new supercolumn
consolidatejsonobjects NorthwindJSON.txt JSONout.txt Orders "Order Details" ItemEntry_[ID] "OrderID" yes yes 4
rem .
rem Bring in exactly one employee per order as a new supercolumn
distributejsonobjects JSONout.txt JSONout2.txt Orders "Employees" Employee "EmployeeID" yes yes 4
rem .
rem Bring in exactly one customer per order as a new supercolumn
distributejsonobjects JSONout2.txt JSONout3.txt Orders "Customers" Customer "CustomerID" yes yes 4
rem .
rem Bring in exactly one product entry per item entry as new columns in existing supercolumn
distributejsonobjects2 JSONout3.txt JSONout4.txt Orders ItemEntry_* Products Product_ ProductID yes yes 4
rem .
rem Bring in one supplier entry and one category entry per item entry as new columns in existing supercolumn
rem   (note these depend on previously imported Product entries)
distributejsonobjects2 JSONout4.txt JSONout5.txt Orders ItemEntry_* Suppliers Supplier_ Product_SupplierID yes yes 4
distributejsonobjects2 JSONout5.txt JSONout6.txt Orders ItemEntry_* Categories Category_ Product_CategoryID yes yes 4
rem .
rem Demote remaining rowkey-columns into a new supercolumn
rem .
demotejsonpairsintonewobject JSONout6.txt JSONout7.txt Orders OrderInformation yes yes 4

Understanding the Flow

The top part of the batch file simply reinterates what we planned in the previous article.  The end is where all the processing actually takes place.  Within the “rem”ed lines there are 7 JSON conversions, each performing a single piece of the consolidation.

Step 1: We open the list of “Order Details” and pull in zero, one or multiple entries as new “ItemEntry_[ID]” nodes.

Steps 2 and 3: We reach out and import exactly one Customer node and one Employee node into our master Order record.

(Note that Steps 1, 2 and 3 can really occur in any order because these three entities are completely independent.  However, “Order Details” must have been imported before Step 4 can take place.)

Step 4: We reach out and import the contents of exactly one Product node into each existing ItemEntry node by looking up ProductIDs from existing ItemEntry name/value pairs.

(Note that Step 4 – the Product import – must take place before Steps 5-6 can take place.)

Steps 5-6: We reach out and import the contents of exactly one Category node and exactly one Supplier node into each existing ItemEntry node by looking up CategoryIDs and SupplierIDs from recently imported ItemEntry > Product information.

Step 7: Consolidate the remaining entries under the main Order entry into a new OrderInformation node.

(Step 7 – the consolidation of hanging entries – should always go last in case previous steps depend on keys in their original positions.)

To see complete sample output from each step of the conversion (as generated by the .NET utilities), unpack the archive here.

Running the Batch File

You may need to change the name of the original “NorthwindJSON.txt”  and/or may want to change the name of the final output file (from “JSONout7.txt”), but other than that the batch file should be ready to run.  To capture log and error output so it’s easier to read, use this command:

D:\divconq\dotnet>ConsolidateNorthwindJSON.bat > log.txt 2> err.txt

Next Steps

The next article in this series shows how to import the complete JSON document into Cassandra.

More About The Utilities

ConsolidateJSONObjects is used to dive into a foreign table used to express an N:N relationship from the point of view of one entity in the relationship.  It generates 0-N supercolumn-level JSON nodes for each foreign node discovered.  This utility is used once during the Northwind conversion to bring in 0-N item entries per order.

consolidatejsonobjects NorthwindJSON.txt JSONout.txt Orders "Order Details" ItemEntry_[ID] "OrderID" yes yes 4

DistributeJSONObjects and DistributeJSONObjects2 are used to select foreign keys in the main table and look them up against a foreign table: it works N:1 relationships and brings in exactly one node for each foreign key scanned (assuming DB consistency, of course).   The difference between these utilities is that DistributeJSONObjects imports discovered nodes as supercolumn-level nodes while DistributeJSONObjects2 imports the contents of discovered nodes as column-level name-value pairs under an existing supercolumn-level node.  These utilities are used five times during the Northwind conversion: two invocations of DistributeJSONObjects to import 1-per-order Employee and Customer nodes as supercolumns and three invocations of DistributeJSONObjects2 to import the contents of 1-per-itementry Product, Supplier and Category nodes.

rem Bring in exactly one employee per order as a new supercolumn
distributejsonobjects JSONout.txt JSONout2.txt Orders "Employees" Employee "EmployeeID" yes yes 4
rem .
rem Bring in exactly one customer per order as a new supercolumn
distributejsonobjects JSONout2.txt JSONout3.txt Orders "Customers" Customer "CustomerID" yes yes 4
rem .
rem Bring in exactly one product entry per item entry as new columns in existing supercolumn
distributejsonobjects2 JSONout3.txt JSONout4.txt Orders ItemEntry_* Products Product_ ProductID yes yes 4
rem .
rem Bring in one supplier entry and one category entry per item entry as new columns in existing supercolumn
rem   (note these depend on previously imported Product entries)
distributejsonobjects2 JSONout4.txt JSONout5.txt Orders ItemEntry_* Suppliers Supplier_ Product_SupplierID yes yes 4
distributejsonobjects2 JSONout5.txt JSONout6.txt Orders ItemEntry_* Categories Category_ Product_CategoryID yes yes 4

DemoteJSONPairsIntoNewObject is simply used to clean up lingering columns (name/value pairs) at the “row” level and put them into a new supercolumn-level object.   One invocation of this utility is used during the Northwind conversion.

demotejsonpairsintonewobject JSONout6.txt JSONout7.txt Orders OrderInformation yes yes 4

All utilities will display additional help, including complete documentation for each parameter at the command-line if invoked without any command-line parameters.  (e.g., simply execute “consolidatejsonobjects.exe” to find out what the eighth parameter – “yes” in the example above – means)

Troubleshooting

If you are experimenting with these utilities and data sets of any size it is highly recommended that you put all your commands in a batch file.  Then, from the command prompt, run your batch file like this:

D:\divconq\dotnet>divconq_test23.bat > log.txt 2> err.txt

…to capture logging output and error output in two separate files.

 

Posted by on 2010-Aug-08 in Cassandra, Intermediate, JSON, Northwind

Comments Off

Elastic Architecture

“Elastic architecture” is a concept you will read about more frequently as time goes on. It refers to computer architecture designed such that applications with different roles in different tiers of an application can each intelligently (and elastically) scale up or down to meet processing requirements.

We are not the first people to name this concept. Yahoo.com’s Eran Hammer-Lahav talked about elastic architecture in an August 2007 blog post. In this post he discussed two intersecting themes: applications that could scale themselves, and tiered deployments that rely on a mix of caching, acceleration and replication to keep up with the layers that are horizontally scaling to meet the current load.

Software architect and trainer Simon Brown also came close to naming this concept in a May 2008 blog post. In this post he talked about a “cloud (that) could migrate your data/apps automagically, depending on where they were being accessed from”. This certainly seems like an application that would require multiple layers to intelligently scale horizontally in multiple geographic locations; that’s an example of elastic architecture.

As you probably know by now, DivConq’s main goal is to promote the adoption of highly scalable, cloud-portable technology in multiple tiers of an application. (For example, using Cassandra as the data store at the same time you’re using an application layer built on an array of high-throughput web servers.) Now that goal has a name and we’re proud to promote the adoption of elastic architecture throughout the IT industry.

 

Posted by on 2010-Aug-08 in Cloud, Elastic Architecture

Comments Off

Next Generation Web Servers: The Video!

The information Andy White presented in his “Web Server Threading Models” article is now also available in a recorded “PechaKucha style” presentation entitled “Beyond IIS and Apache: Next Generation Web Servers (nginx, kayak, jetty, etc.)” on YouTube.

http://www.youtube.com/watch?v=jsxynVpOnI0 (6 minutes)

 

Posted by on 2010-Aug-08 in Introduction, Jetty, Kayak, lighttpd, Nginx

Comments Off

Why does Cassandra have data types?

You have probably noticed that Apache Cassandra, through Thrift, exposes it’s column names, column values and super column names as binary. See the Thrift declarations:

struct Column {
   1: required binary name,
   2: required binary value,
   3: required Clock clock,
   4: optional i32 ttl,
}

struct SuperColumn {
   1: required binary name,
   2: required list columns,
}

On the other hand the column family name, the row key and the keystore name are all type string. See the ColumnPath Thrift declaration:

struct ColumnPath {
    3: required string column_family,
    4: optional binary super_column,
    5: optional binary column,
}

And the ‘get’ method Thrift declaration:

ColumnOrSuperColumn get(1:required string keyspace,
                          2:required string key,
                          3:required ColumnPath column_path,
                          4:required ConsistencyLevel consistency_level=ONE)

Yet, when declaring a Column Family in the config file you do give a data type:

<ColumnFamily Name="Regular1" CompareWith="LongType" />

Which means that the column name is type long.
Read the rest of this entry »

 

Posted by on 2010-Aug-08 in Cassandra, Thrift

Comments Off