RSS
 

Archive for the ‘Beginner’ Category

Introduction to Cassandra Columns, Super Columns and Rows

21 Jul

This article provides new users the basics they need to understand Cassandra’s “column / super column / row” data model.

Though the focus is not on mechanics, this article assumes you are familiar with adding columns to and requesting data from existing keyspaces on Cassandra. If not, please see my earlier article on that topic.  Knowledge of JSON is also important to understand the data examples below – if you need help here, please see my earlier article on that topic.

Remember that a Cassandra column is basically a “name=value” pair* (e.g., “color=red”).  You can use multiple columns to represent data such as:

    "Price" : "29.99",
    "Section" : "Action Figures"

As you may have seen in my previous article, multiple columns can also be grouped in Cassandra “rows” to handle data  such as:

{
  "Transformer" : {
    "Price" : "29.99",
    "Section" : "Action Figures"
  }
  "GumDrop" : {
    "Price" : "0.25",
    "Section" : "Candy"
  }
  "MatchboxCar" : {
    "Price" : "1.49",
    "Section" : "Vehicles"
  }
}

The keys used to group related columns into rows in this example were “Transformer”, “GumDrop” and “MatchboxCar”.

In my earlier article, we looked at this data structure:

In JSON, this keystore->column family->row->column data structure would be represented like this:

{
  "ToyStore" : {
    "Toys" : {
      "GumDrop" : {
        "Price" : "0.25",
        "Section" : "Candy"
      }
      "Transformer" : {
        "Price" : "29.99",
        "Section" : "Action Figures"
      }
      "MatchboxCar" : {
        "Price" : "1.49",
        "Section" : "Vehicles"
      }
    }
  },
  "Keyspace1" : null,
  "system" : null

}

If you simply wanted to add other types of unrelated collections of information (e.g., “BugCollection” or “PaintColors”), you’d simply keep adding new keyspaces for each new collection.  However, if you needed to keep track of similar collections of data (e.g., your Ohio and New York toy stores instead of a single toy store) you’d need to turn to a different kind of Cassandra element: the “super column”.

To see super columns in action, inspect this keystore->column family->row->super column->column data structure as it appears in JSON:

{
  "ToyCorporation" : {
    "ToyStores" : {
      "Ohio Store" : {
        "Transformer" : {
          "Price" : "29.99",
          "Section" : "Action Figures"
        }
        "GumDrop" : {
          "Price" : "0.25",
          "Section" : "Candy"
        }
        "MatchboxCar" : {
          "Price" : "1.49",
          "Section" : "Vehicles"
        }
      }
      "New York Store" : {
        "JawBreaker" : {
          "Price" : "4.25",
          "Section" : "Candy"
        }
        "MatchboxCar" : {
          "Price" : "8.79",
          "Section" : "Vehicles"
        }
      }
    }
  }
}

This data could also be visualized like this:

Given its late appearance, you might expect that “Ohio Store” and “New York Store” would represent super columns that span multiple rows.   However, the opposite is true:  “Ohio Store” and “New York Store” are now the row keys and entries like “Transformer”, “GumDrop” and “MatchboxCar” have become super columns keys.

Like column keys, super column keys are indexed and sorted by a specific type (e.g., “UTF8Type”, ”AsciiType”, “LongType”, “BytesType”, etc.).    However, like row keys, super column entries have no values of their own; they are simply used to collect other columns.

Notice that the keys of the two groups of super columns do not match.  ({“Transformer”, “GumDrop”, “MatchboxCar”} does not match {“JawBreaker”, “MatchboxCar”}. )  This is not an error: super column keys in different rows do not have to match and often will not.

In a future article I will describe how to how to pass JSON structures that describe data in either row/column or row/super column/column format into and out of Cassandra – stay tuned!

* = We’ll ignore the timestamp element of Cassandra columns for now.  These timestamps are used to reconcile updates from multiple nodes, but don’t worry about that until you understand the whole column/supercolumn thing first.

 
Comments Off

Posted in Beginner, Cassandra, JSON

 

Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)

20 Jul

This article beings to explore how to migrate a relational database structure (tables linked by foreign keys) into a NoSQL database structure that can be used in Cassandra.

If you do not know what Cassandra is, why NoSQL and Cassandra are important technologies or what JSON is and why you should know it, please click the links in this sentence to learn more about each topic before proceeding.

The Original Relational Database Structure

We are going to start with a very simple 1:N relational database structure. Our first two tables are “forests” and “famoustrees”.  Here is our data in tabular format:

forests:

famoustrees:

“famoustrees” is linked to “forests” using the “forestID” foreign key.  Notice that there are no famous trees in the “Lonely Grove” forest, one famous tree in the “100 Acre Woods” and two famous trees in the “Black Forest”.

If we were to represent the data in our database – call it our “biologicalfeatures” database – in JSON, it would look like this:

{
  "biologicalfeatures":
    {
    "forests" :
      {
      "forest003" :
        {
          "name" : "Black Forest",
          "trees" : "two million",
          "bushes" : "three million"
        },
      "forest045" :
        {
          "name" : "100 Acre Woods",
          "trees" : "four thousand",
          "bushes" : "five thousand"
        },
      "forest127" :
        {
          "name" : "Lonely Grove",
          "trees" : "none",
          "bushes" : "one hundred"
        }
      },
    "famoustrees" :
      {
      "tree12345" :
        {
          "forestID" : "forest003",
          "name" : "Der Tree",
          "species" : "Red Oak"
        },
      "tree12399" :
        {
          "forestID" : "forest045",
          "name" : "Happy Hunny Tree",
          "species" : "Willow"
        },
      "tree32345" :
        {
          "forestID" : "forest003",
          "name" : "Das Ubertree",
          "species" : "Blue Spruce"
        }
      }
    }
}

Denormalizing the Tables

To collapse the famoustrees table into our forests table, we need to move each famoustree entry underneath its forest entry.  We can also also remove the foreign “forestID” key from each famoustree entry – we don’t need that anymore.

However, we should retain the type of each famoustree entry we moved into the forest entry.  We can do this by adding an extra “type” value to each entry.

Finally, we could break out the original non-ID information in each forest entry into a typed section too.  We’ll tag each of these sections with a new ID of “generalinfo”.  (This is a Cassandra-friendly convention – we’ll get into this more below.)

Represented in JSON, our data now looks like this:

{
  "biologicalfeatures":
    {
    "forests" :
      {
      "forest003" :
        {
        "generalinfo" :
          {
          "name" : "Black Forest",
          "trees" : "two million",
          "bushes" : "three million"
          },
        "tree12345" :
          {
            "type" : "famoustree",
            "name" : "Der Tree",
            "species" : "Red Oak"
          },
        "tree32345" :
          {
            "type" : "famoustree",
            "name" : "Das Ubertree",
            "species" : "Blue Spruce"
          }
        },
      "forest045" :
        {
        "generalinfo" :
          {
          "name" : "100 Acre Woods",
          "trees" : "four thousand",
          "bushes" : "five thousand"
          },
        "tree12399" :
          {
            "type" : "famoustree",
            "name" : "Happy Hunny Tree",
            "species" : "Willow"
          }
        },
      "forest127" :
        {
        "generalinfo" :
          {
          "name" : "Lonely Grove",
          "trees" : "none",
          "bushes" : "one hundred"
          }
        }
      }
    }
}

Ready for Cassandra?

There are really only two types of JSON data structures that can be imported directly into Cassandra.  One is the
keystore->columnfamily->rowkey->column
data structure shown below:

{
  "keystore":
    {
    "columnfamily" :
      {
      "rowkey" :
        {
          "column name" : "column value"
        }
      }
    }
}

Add another layer and you get the other supported data structure
keystore->columnfamily (a.k.a. “supercolumnfamily”)->rowkey->supercolumn (a.k.a. “subcolumn”)->column
shown below:

{
  "keystore":
    {
    "columnfamily" :
      {
      "rowkey" :
        {
        "supercolumn" :
          {
          "column name" : "column value"
          }
        }
      }
    }
}

That’s it: if you can get your data to fit into one of those two JSON structures, your data is ready to be input into Cassandra.

You probably suspect that I wouldn’t have taken you this far if our forests data wasn’t ready for Cassandra, but please take a moment to scroll up and see if you can figure out whether our denormalized forests data uses supercolumns or not.

Let’s break it down:
biologicalfeatures -> forests
…matches the keystore->columnfamily structure used by both supported JSON structures.

As for the rest:
forest003 -> generalinfo -> (name=”Black Forest”)
…matches the rowkey->supercolumn->column structure used by the “supercolumn” supported JSON structure.

So, yes, we had to use supercolumns to denormalize the forests and famoustrees tables properly.

Next Steps

Next we’ll perform this type of analysis on the Northwind JSON structure exported in a previous article.

Doing this type of normalization by hand would be a large PITA, so DivConq created a utility to do this automatically. The article after that shows how to use that DivConq utility and a few more like it to complete the conversion of the Northwind JSON in Cassandra-ready format.

Soon after that I will also cover how to import the final JSON data directly into Cassandra – stay tuned!

 
Comments Off

Posted in Beginner, Cassandra, nosql

 

Web Server Threading Models

18 Jul

Being very performance minded I’d like to take a tangent to server performance. Server performance is tightly connected to the choice of threading model. In particular I’m referring to the threading model of dynamic web pages: PHP, ASP, JSP, Rails, etc. About the worst thing you can do for performance is to have a thread just waiting. Just throwing more threads at the problem does not resolve it – though it can help. But preventing any waits at all is the superior solution by far.

At a high level there are three thread modeling choices in server design:

1) A single thread reads the request, then calls the script that generates the page and waits for the script to complete after which any final writes are completed. The thread stays with the request from the first byte read to the last byte written. There are ways to make this model even less efficient, but lets leave it at that.

This approach can result in a lot of wait time on the thread while it waits for request buffers to read and waits for ACKs when writing response buffers.

There is also an inherent issue in that all the data about the request must be collected up front before the script is called. This means waiting for all the request buffers to be delivered and, in the case of large requests (file uploads) caching the request on disk.

There are two problems with that:

a) memory (RAM or disk) is being filled up by request data and often by response data too

b) data that gets on disk may be vulnerable since it will not typically be encrypted

2) Another approach – called async I/O – is to keep a pool of threads that is used to read requests a buffer full at a time. The thread is active only long enough to process that one buffer and then the thread goes back into the pool. Once the request is completely assembled a script is called and the server waits for the script to complete.

This approach gets past the issue with waiting the reads and writes (when response buffers are used) but still has the memory issues, security issues and the wait on the script. Another way to state that is the script *must* return a complete response and that the server dedicates at least one thread to the script for the entire duration of the response generation.

An exception is when response buffering is not used, however, when response buffering is not used the async I/O benefits are lost for the output.

Note the server’s thread line has holes in it now – a good thing because it frees resources.

3) Another approach – called full async – uses a pool of async I/O threads for reading and writing to the client. However, it also allows for the script generating the response to be fully async.

Note another hole appears during the run of the script because the script has made an async operation itself. This capability is very important now that we use web services so often – it makes no sense to block a thread while calling a web service.

The most evolved examples of full async are perhaps better named streaming async. Streaming async gets around the problem of caching files for upload or download – while still retaining the async I/O model.

I myself have developed a couple of streaming (full) async servers at my work, unfortunately they are not open source. At some point I’ll discuss streaming async in more detail.

The full async server concept is gaining ground. Some examples are the new comer nodejs which does not send a response until you call ‘end’ on the response object. Meaning the script’s thread can exit without the response being sent. This is not at all like PHP, ASP or other standard web servers.

Another example is upcoming Ruby on Rails version 3, check out the
async_sinatra project and async_rails project. They are going so far as to make even calls to the database async – a great idea! Keep in mind my articles about Nginx when checking out Thin.

Yet another example is the Kayak HTTP Server’s responder interface. Kayak is a Dotnet based web server.

In the world of Java, the soon to be released Jetty 7 has extended the Continuations API (previously only used with Comted) for all web scripts so now Java 3.0 servlets can be full async as well.

From the examples we can see that only in the last year has the full async model become trendy, and for good reason, it is very important for scalability of a web server.

 
Comments Off

Posted in Beginner, Elastic Architecture

 

Export a Microsoft Access Database to JSON (Northwind Example)

15 Jul

This article shows how to use a command-line utility to export an existing Microsoft Access database to a text file containing a single JSON structure.

If you do not know what JSON is, please see my earlier article on that subject.

The Microsoft Access database we are going to export today is the (in)famous “Northwinds” example database which millions of Microsoft programmers have been exposed to.

Download and Install the “Northwind” Access Database

First, download a copy of the Northwind Access database from here.

Save the ZIP file and then unpack the “Northwind.mdb” file into a local directory on your Windows file system.  (Optional: open the database with Microsoft Access – note that the utility described below doesn’t need Microsoft Access to work.)

Now open your Windows Control Panel, get into “Administrative Tools” and then open the “Data Sources (ODBC)” panel.   Go to the “System DSN” tab and add a new “Microsoft Access Driver (*.mdb)” entry.  Select your local copy of “Northwind.mdb” when prompted and name your DSN (“Data Source Name”) something like  “NorthwindTest”.

Download and Install the “Access to JSON” Export Utility

We wrote a command-line utility to help you export your Access database as a single piece of JSON.  Download a copy of this here.

Save the ZIP file and then unpack the enclosed files into a local directory on your Windows file system.    To make sure the utility unpacked OK, please open a command prompt, CD into the directory where the executables are and then run “exportaccessdbtojson” from the command line. The expected output is shown below.

C:\utilities>exportaccessdbtojson

ExportAccessDBToJSON v1.0.0.2 –  August 6, 2010
Please visit http://www.divconq.com for more information.

Usage:
ExportAccessDBToJSON NameOfDSN [-JustJSON] [ID4Tables] [> OutputFilename]
…where ID4Tables is a comma-delimited list of tables to which you
want the utility to add an incremented AutoID field.

Example:
To export an existing Access database defined to your system
with an ODBC DSN of “Northwinds” into a file that will contain
a mix of log messages and JSON output, use the following command:
ExportAccessDBToJSON Northwinds > JSONandLog.txt

To export only the JSON output in the output file, use this instead:
ExportAccessDBToJSON Northwinds -Quiet > JustJSON.txt

Export The Access Database (With Log Output)

Start by trying to execute the following command (with the name of your actual DSN in place of “NorthwindTest”:

C:\utilities>exportaccessdbtojson NorthwindTest

If all is well you will see many lines of text stream by ending with something like this:

“Phone” : “(514) 555-2955″,
“Fax” : “(514) 555-2921″,
“HomePage” : “”
}
}
}=-=-=-= END JSON =-=-=-=
22:00:26 Completed OK.

If not, you may see an error like this:

C:\utilities>exportaccessdbtojson BadNorth
22:01:02 ERROR: Could not connect to DSN=BadNorth! System.Data.Odbc.OdbcExce
ption: ERROR [IM002] [Microsoft][ODBC Driver Manager] Data source name not found
and no default driver specified

If things are going poorly, please see the “Troubleshooting” section below.  However, if things went well, please proceed to the next section.

Export The Access Database As A Single JSON File

Now, let’s turn off the extraneous log messages and the superflous “START JSON” and “END JSON” tags.   Let’s also capture our output into a file so we can look at using our favorite text editor.

Use these commands to perform the export and view the output, again substituting in the correct name of your DSN and any output filename you wish:

C:\utilities>exportaccessdbtojson NorthwindTest -justjson > NorthwindJSON.txtC:\utilities>notepad NorthwindJSON.txt

If you would like to just see the expected JSON output from the Northwind database, please download the Zip file that contains that data as “NorthwindJSON.txt” here.

Next Steps

The next article talks about how to fit a simple relational database structure into the form needed by Cassandra today. The article after that expands on the concept to rearrange the Northwind JSON structures you just exported into a Cassandra-friendly format. Then, in a future articl,e I will show how to import the converted JSON into Cassandra.  Stay tuned!

How the Utility Works

This utility connects to the named Access database, lists all the tables, assumes the first column as the key in each database, and then exports the complete contents of each database as a series of string-based name value pairs, collected on a row-by-row basis by the value of the key column.  The “ID4Tables” argument is used to add unique IDs to tables that lack them: e.g., the “Order Details” table in Northwinds.  JSON structures created will include escaped special characters (e.g., “\\” to represent a backslash) but are not passed through a full JSON validation step before they are written so they are not 100% guaranteed to work in your JSON parser.  (These structures do, of course, work with the Northwind Access database.)

This utility can be used on databases other than the Northwind sample database but has not been extensively tested against many others – your mileage may vary.

Troubleshooting

Most problems with this process are caused by typos in your command-line invocation or in your DSN name in the ODBC setting.  Also make sure you set up a System, not a User DSN entry.

Also, please avoid DSNs with spaces or other characters.   The utility wasn’t tested against these.

 
Comments Off

Posted in Beginner, JSON, Northwind

 

What’s the difference between a supercolumnfamily and a columnfamily with subcolumns in Cassandra?

13 Jul

This article covers the difference between a supercolumnfamily and a columnfamily with subcolumns in Cassandra.

First, remember that in Cassandra terminology, “subcolumn” = “supercolumn” = “sub column” = “supercolumn”.

With that in mind, a “super column family” is really just a “column family…that contains super columns under its rows”.  (As opposed to a regular “column family” that merely contains rows without supercolumns.)

The confusion comes about because “super column family” entries look like this:

<ColumnFamily Name="Super1"
              ColumnType="Super"
              CompareWith="BytesType"
              CompareSubcolumnsWith="BytesType" />

..and plain old “column family” entries look like this:

<ColumnFamily Name="Regular1"
              CompareWith="BytesType" />

…both use a tag named “ColumnFamily” in Cassandra’s “storage-conf.xml” definition file.

Personally, I prefer using the term “Column Family” to cover both column families with rows that contain supercolumns as well as column families with rows that don’t contain supercolumns.  But if someone uses the term “super column family” they always mean “a column family that contains rows that contain supercolumns.”

 
Comments Off

Posted in Beginner, Cassandra

 

What’s the Difference Between a a SuperColumn and a SubColumn in Cassandra?

12 Jul

This article covers the difference between a supercolumn and a subcolumn in Cassandra.

Let me cut to the chase: there is no difference.  They are two terms for exactly the same thing.

If you are familiar with a typical keystore->column family->row->super column->column structure, such as the one pictured below, then you could safely replace all instances of the phrase “super column” with “subcolumn” without changing the meaning.

The confusion around “super column” vs. “sub column” is fueled largely by the Cassandra configuration file.  In your “storage-conf.xml” file you will see XML “ColumnFamily” configuration elements like this:

      <ColumnFamily Name="Super1"
                    ColumnType="Super"
                    CompareWith="BytesType"
                    CompareSubcolumnsWith="BytesType" />

If this was was a plain old “ColumnFamily” entry, you would only see this:

      <ColumnFamily Name="Regular1"
                    CompareWith="BytesType" />

…but this is a “Super Column Family”, so there are two extra attributes:

  • ColumnType=”Super” to tell Cassandra that this column family will contain super columns.
  • CompareSubcolumnsWith=”BytesType” to tell Cassandra that our sub columns will be sorted through bit-by-bit comparison.

Confused?  If so, go back and read the last two bullets again while telling yourself:
“super column = sub column = supercolumn = subcolumn…”

 
Comments Off

Posted in Beginner, Cassandra

 

What is Cassandra’s “system” keystore used for?

07 Jul

The “system” keystore in Cassandra is a built-in keystore. Unlike built-in databases or keystores in other database software Cassandra’s system keystore is NOT used for long-term storage of significant amounts of configuration information. Instead, this keystore is mostly used to coordinate updates of user data between multiple Cassandra systems.

There are essentially two groups of information stored in the Cassandra system keystore. One is information about the current node, such as the “token”, “cluster name” (e.g., “Test Cluster” in the default “storage_conf.xml”), and a “automatic bootstrap” flag that allows nodes added for performance reasons in larger clusters to automatically grab data and come up when ready. All of these settings are originally configured in your “storage_conf.xml” file.

The second type of information stored in the Cassandra system keystore is used to cache and forward updates for other nodes, especially nodes that are temporarily down. (The second type of information is known in the Cassandra world as “hinted handoff” data.)

On disk you will find the physical files that make up Cassandra’s system keystore in locations such as:
D:\var\lib\cassandra\data\system\LocationInfo-1-Data.db
D:\var\lib\cassandra\data\system\LocationInfo-1-Filter.db
D:\var\lib\cassandra\data\system\LocationInfo-1-Index.db
D:\var\lib\cassandra\data\system\LocationInfo-2-Data.db
D:\var\lib\cassandra\data\system\LocationInfo-2-Filter.db
D:\var\lib\cassandra\data\system\LocationInfo-2-Index.db

I’m using Cassandra 0.6.2 on Windows 7 here – you could imagine where this would be on a *nix system ;)

If you’ve made it this far you may notice that Cassandra uses a different subfolder in “data” for each keystore and THREE files for each Column Family within a keystore. The Cassandra system table uses the same structure: one data file of “sorted strings” for the key/value pairs that make up each column, one index file made up of column keys plus the location of the data (a.k.a. the “offset”) in the data file and one “filter” file simply containing all the keys (in “bloom filter” format, if you’re interested).

Long story short – don’t worry about Cassandra’s system folder until you’re ready to work with multiple Cassandra servers across a distributed cluster. Even then, if things are running normally, you still shouldn’t have much*, if any, interaction with the contents of this built-in keystore.

* = unless you decide to change the name of your cluster.

 
Comments Off

Posted in Beginner, Cassandra

 

How to Add and Retrieve Data from a Cassandra Database

06 Jul

This article describes how to create a new keyspace on a Cassandra database server, how to add data to that keyspace and how to run some simple queries against that data.

If you do not know how to install and run Cassandra, please read my earlier article on that subject first.

You should also know a little about XML: specifically how to find and fix unmatched XML tags in case you typo during the following instructions.  (Hint: file-by-file backups are a good idea until you get comfortable.)

Create a New Keyspace on Cassandra

First, sign on to your Cassandra server using the “cassandra-cli” client.  Use the “show keyspaces” command to ensure you have a live connection to the server and to make sure the keyspace you are about to add doesn’t already exist.

cassandra> show keyspaces
Keyspace1
system

These two keyspaces are automatically installed when you installed Cassandra and are completely independent of one another – like separate databases on a relational database system would be.  A diagram of two separate keyspaces in our Cassandra database would look like this:

We want to add a new keyspace called “ToyStore”.  Once we’re done, we’d expect our diagram to look like this:

Now, we’re working with the current stable version (“0.6.3″ at the time I wrote this article) , so we’ll need to add our new keyspace by hand.   Close down your server and the client and then go into your Cassandra “conf” directory.  Open the “storage-conf.xml” file with your favorite text editor.

Around line 60 you’ll see a tag that looks like this:

<Keyspaces>
  <Keyspace Name="Keyspace1">

When you scroll further you’ll see come to the end of the “Keyspaces” section.  Note that there is no ‘<Keyspace Name=”system”>’ entry; all entries here are for “user” databases.

After the XML tags for “keyspace1″, please add the following XML snippet to create an new, empty keyspace called “ToyStore”.   This snippet also creates a “column family”  (if you’re a relational database user think “table” for now) called “Toys”.

<Keyspace Name="ToyStore">
  <ColumnFamily Name="Toys" CompareWith="UTF8Type" />
  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
  <ReplicationFactor>1</ReplicationFactor
  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
</Keyspace>

The “CompareWith” attribute in the “ColumnFamily” tag you just added controls how row keys are indexed and sorted.  The “UTF8Type” value key indicates that you’re indexing by UTF8 characters.  Other possible values include”AsciiType”, “LongType” (64-bit long integers) and “BytesType” (straight bit-to-bit comparison – the default value).

The other tags here (“ReplicaPlacementStrategy”, “ReplicateFactor” and “EndPointSnitch”) set a few replication options we’ll get into later.

Save your changes and restart your Cassandra server.  If you can connect with the Cassandra client, issue the following command and see “ToyStore” in the result you’ve succeeded in adding your first keyspace to Cassandra!

cassandra> show keyspaces
Keyspace1
system
ToyStore

Add Data To An Existing Keyspace on Cassandra

Now that we have a new “ToyStore” keyspace it’s time to add some data.  If you were watching closely you’ll notice that we did more than add a keystore in the previous step: we added our first “column family” too.  (Think “table” if you’re coming from a relational database background.)

To get started adding data, restart your Cassandra client and use the following syntax to add six name/value pairs to the “Toys” column family of your new “ToyStore” keyspace.

cassandra> set ToyStore.Toys[‘Transformer’][‘Price’] = ‘29.99’
Value inserted.
cassandra> set ToyStore.Toys[‘GumDrop’][‘Price’] = ‘0.25’
Value inserted.
cassandra> set ToyStore.Toys[‘MatchboxCar’][‘Price’] = ‘1.49’
Value inserted.
cassandra> set ToyStore.Toys[‘Transformer’][‘Section’] = ‘Action Figures’
Value inserted.
cassandra> set ToyStore.Toys[‘GumDrop’][‘Section’] = ‘Candy’
Value inserted.
cassandra> set ToyStore.Toys[‘MatchboxCar’][‘Section’] = ‘Vehicles’
Value inserted.

If you run a “help” command from the Cassandra client you will see the following syntax for the kind of “set” command we just used:

set <ksp>.<cf>['<key>']['<col>'] = '<value>'

Let’s break this command syntax down using one of the commands we just typed.

set ToyStore.Toys['Transformer']['Price'] = '29.99'

According to our command syntax, the command we typed meant this:

  • ksp = KeySpace = “ToyStore”
  • cf = Column Family = “Toys”
  • key = Row Key (an indexed key which links multiple columns) = “Transformer”
  • col = Single Column Name (the name in a single name/value pair) = “Price”
  • val = Single Column Value (the value in a single name/value pair) = “29.99”

These six commands created a total of three rows in the “Toys” column family: “Transformer”, “GumDrop” and “MatchboxCar”.  Within each row you created two columns: “Section” and “Price”.   Sketched out in a diagram the data you inserted would look something like this:

Within the “Toys” column family, you could also represent this data in JSON like this:

{
  "Transformer" : {
    "Price" : "29.99",
    "Section" : "Action Figures"
  }
  "GumDrop" : {
    "Price" : "0.25",
    "Section" : "Candy"
  }
  "MatchboxCar" : {
    "Price" : "1.49",
    "Section" : "Vehicles"
  }
}

Starting to make sense? Now, let’s try to pull this data back.

Retrieve Data From An Existing Keyspace on Cassandra

Let’s start by counting the number of name/value pairs (i.e., “columns”) stored under one of the row keys we just inserted.

cassandra> count ToyStore.Toys[‘GumDrop’]
2 columns

If you followed directions, the answer will be “2 columns”, whether you use “GumDrop”, “Transformer” or “MatchboxCar” as your column key.

Now try spelling out the row key in all lowercase.

cassandra> count ToyStore.Toys[‘gumdrop’]
0 columns

Yes, Cassandra row keys are case-sensitive. Consider yourself warned, especially if you’re coming from a database environment where cases are insensitive.

Now trying spelling out the row key that doesn’t exist.

cassandra> count ToyStore.Toys[‘RedMatterBall’]
0 columns

Notice that you didn’t get a “no column exists” error on your count statement; instead you were simply told that zero name/value pairs exist for your non-existent row key.

Now that you know to be careful with the exact name and case of your row keys, let’s pull back the data in a particular row instead of just counting how many columns it contains. To do this, use the “get” command as shown below.

cassandra> get ToyStore.Toys[‘GumDrop’]
=> (column=Section, value=Candy, timestamp=1278132493790000)
=> (column=Price, value=0.25, timestamp=1278132306875000)
Returned 2 results.

The two “column” and “value” entries look familiar but there’s a third item in each of our columns: “timestamp”. That value represents the time when you made each column entry. Timestamp may not mean much to us yet (we will safely ignore it for another article or two), but timestamp will mean a great deal to us when we start merging column inserts/updates from two or more Cassandra database nodes.

By the way, here’s how you could represent the timestamp on each column in your diagram:

But back to our data retrieval task. Before we move on, try at least one row key that doesn’t exist.

cassandra> get ToyStore.Toys[‘RedMatterBall’]
Returned 0 results.

Again, note that Cassandra reports that there are “0 results” for this row key, not that this row key doesn’t exist.

The last thing we’re going to do in this article is drill down into an existing row and only pick out one column (i.e., one name/value pair).

cassandra> get ToyStore.Toys[‘GumDrop’][‘Price’]
=> (column=Price, value=0.25, timestamp=1278132306875000)

Now try this with a valid row key and an invalid column.

cassandra> get ToyStore.Toys[‘GumDrop’][‘Taste’]
Exception null

This time we got an error rather than a “count of zero” message!

Relational database folks, are you starting to see the pattern? (Hint: Using non-existent row keys is like executing a “SELECT COUNT(*) FROM DB” with a WHERE clause that matches nothing, but using non-existent column names is like executing a query with invalid fields.)

For more information about columns, row keys and another Cassandra structure called a “super column” (or “subcolumn”), please see my recent article on that subject.

Troubleshooting

If you encounter a “Fatal error: Invalid replicaplacementstrategy class org.apache.cassandra.locator.RackUnawareStrategy” while trying to start up your Cassandra server, this may be caused by spaces or newlines in your “ReplicaPlacementStrategy” keyspace definition. (Remember this definition is found in your “storage-conf.xml” file.) Make sure this line looks like: “<ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>” – with NO spaces or other spacing characters anywhere within this tag!

If you encounter a “supercolumn parameter is not optional for super CF” when you try to issue an otherwise valid “set” command from the Cassandra client, you probably incorrectly defined the column referenced in your set command as a “super” column in your keyspace definition. To fix, take out the ‘ColumnType=”Super”‘ and ‘CompareSubcolumnsWith=”UTF8Type”‘ attributes from that column’s definition in your “storage-conf.xml” file, then restart the Cassandra server.

 
Comments Off

Posted in Beginner, Cassandra

 

Simple, Secure and Speedy – Part Two

05 Jul

Amazon’s EC2 (Elastic Compute Cloud) enables an user to deploy a fully customized Linux or Windows server in the Amazon cloud. Beyond the security provided by the built-in (Windows or Linux) firewall, EC2 provides another layer of firewall called a security group.

A security group may be configured to filter IP traffic – to allow access to only certain ports from certain Internet address ranges. IP traffic that does not pass the filter rules will never even touch the server. All servers running in a security group assume the same filter rules.

Furthermore, a security group may be configured to allow access only from another security group. Thus enabling very secure multi-tier deployments in Amazon’s EC2.

private tier in Amazon's EC2

Read the rest of this entry »

 

Installing and Running the Cassandra Database

05 Jul

This article provides instructions to download, install and run two major components (one client, one server) of the Cassandra database on your Windows system.

If you do not know what Cassandra is, please read my earlier article on this subject first.

Prerequisites

Cassandra is written in Java.  Please make sure you have installed Java version 6 or a more recent version before proceeding.

Installing Java also usually sets up proper values of the JAVA_HOME and PATH environment variables.  If you see complaints about “JAVA_HOME” or you see errors like “‘java’ is not recognized as an internal or external command”, please see the “JAVA_HOME and PATH Environment Variables” section near the bottom of this article.

Downloading and Expanding Cassandra

Download a package of Cassandra binary-format programs from http://cassandra.apache.org/download/.

Save this file as “apache-cassandra-0.6.3-bin.tar.gz” (with the version number set appropriately).

Unpack this *.tar.gz file using PeaZip or your favorite archive utility into a new folder of your choice.  I used a new folder called “C:\Work\apache-cassandra-0.6.3″.  Your unpacked files should look like this in Windows Explorer:

Running the Cassandra Server

Open up a command-line window.  Run this command so you’ll be able to tell which window is running the client and which is running the server later:

TITLE CassServer

CD into your “C:\Work\apache-cassandra-0.6.3″ folder, type “bin\cassandra” and hit “Enter”.  Behind the scenes this kicks off a batch file (remember, this is a Java application) that checks to make sure you’ve done your homework on the prerequisites and then attempts to launch the server.

If all goes well, you will see something like this.

C:\Work\apache-cassandra-0.6.3>bin\cassandra
Starting Cassandra Server
Listening for transport dt_socket at address: 8888
INFO 14:20:21,220 Auto DiskAccessMode determined to be standard
INFO 14:20:22,482 Saved Token not found. Using 34630284372509656815837333105728
952419
INFO 14:20:22,482 Saved ClusterName not found. Using Test Cluster
INFO 14:20:22,492 Creating new commitlog segment /var/lib/cassandra/commitlog\C
ommitLog-1278012022492.log
INFO 14:20:22,582 LocationInfo has reached its threshold; switching in a fresh
Memtable at CommitLogContext(file=’/var/lib/cassandra/commitlog\CommitLog-127801
2022492.log’, position=419)
INFO 14:20:22,602 Enqueuing flush of Memtable-LocationInfo@14471083(169 bytes,
4 operations)
INFO 14:20:22,612 Writing Memtable-LocationInfo@14471083(169 bytes, 4 operation
s)
INFO 14:20:23,313 Completed flushing C:\var\lib\cassandra\data\system\LocationI
nfo-1-Data.db
INFO 14:20:23,363 Starting up server gossip
INFO 14:20:23,544 Binding thrift service to localhost/127.0.0.1:9160
INFO 14:20:23,554 Cassandra starting up . . .

Note that the Cassandra server will NOT return to the command prompt if it started up successfully. If you see a “Cassandra starting up…” message (without a return to the command prompt), leave the server be and try running the Cassandra client.

Running the Cassandra Client

Open up a command-line window. Run this command so you’ll be able to tell which window is running the client and which is running the server later:

TITLE CassClient

CD into your “C:\Work\apache-cassandra-0.6.3″ folder, type “bin\cassandra-cli -host localhost -port 9160″ and hit “Enter”. Behind the scenes this kicks off a batch file (remember, this is a Java application) that checks to make sure you’ve done your homework on the prerequisites and then attempts to launch the server.

If all goes well, you will see something like this.

C:\Work\apache-cassandra-0.6.3>bin\cassandra-cli -host localhost -port 9160
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
Type ‘help’ or ‘?’ for help. Type ‘quit’ or ‘exit’ to quit.
cassandra>

Yes, it’s an interactive command prompt – that’s exactly what we were hoping for.

Trying Some Easy Commands

I’ll cover how to actually store and retrieve data in a future article, but let’s at least make sure our client is really talking to our server. Run the following three commands to ask the server to return some basic information about the Cassandra server environment to your Cassandra client.

cassandra> show api version
2.1.0
cassandra> show cluster name
Test Cluster

Unlike relational databases like SQL Server, Cassandra doesn’t organize all its structures and data into “databases” – instead it uses a similar concept called “keyspaces”.

To list all the keyspaces on your new Cassandra server, run the following command.

cassandra> show keyspaces
Keyspace1
system

This server has two keyspaces: a built-in “system” keyspace which contains information about this Cassandra server node’s current state and a user keyspace called “Keyspace1″ that holds data you can insert or read.

Now there’s no “boss key” with Cassandra, so you’ll also need to know how to shut everything down in case you’re trying this on the job at a SQL Server shop. ;)

Shutting Down

To shut down the Cassandra server, go to your “CassServer” window and hit “Ctrl+C”. Answer “y” when asked if you want to terminate the batch file.

To quit the Cassandra command-line client, go to your “CassClient” window and type “exit” at the command prompt. (“Ctrl+C” won’t work here.)

Next Steps: Adding and Retrieving Data

If all is well, please proceed to my article about how to add and retrieve data from your new Cassandra database system.

Troubleshooting

If you get a “NoClassDefFoundError” message which trying to start either the Cassandra client or server, please see my earlier article on that subject.

If you get a “Not connected to a cassandra instance.” message while using the Cassandra client, you probably forgot to specify “-host localhost -port 9160″ on the command line.

JAVA_HOME and PATH Environment Variables

Cassandra’s applications (and many other Java applications) need an environment variable called “JAVA_HOME” defined in order to run.  If you only run a single version of Java on your system, you will probably want to set the value of this environment variable at the system level to something like “C:\Program Files\Java\jre6″ or “C:\Sun\SDK\jdk” (no “bin”, no “lib” – this is just the home directory).

You should also make sure that any Java application you start from the command-line can find its necessary libraries by setting the PATH environment variable appropriately.  If you only run a single version of Java on your system, you may need to APPEND to the value of the PATH environment variable at the system level.  Your appended values will be something like “;C:\Program Files\Java\jre6\bin\” or “C:\Sun\SDK\bin” (don’t point to just the home directory here).

On my Windows systems I often have multiple versions of Java installed so I use an extremely short Windows batch called “prepforjava.bat” which contains two lines to set these environment variables as needed:

SET PATH=%PATH%;C:\Program Files\Java\jre6\bin
SET JAVA_HOME=C:\Program Files\Java\jre6

This batch file is stored in another folder also referenced by my system-level PATH environment variable so I can call it from any command-line window at any time.

 
Comments Off

Posted in Beginner, Cassandra