RSS
 

Archive for the ‘Cassandra’ Category

Introduction to Cassandra Columns, Super Columns and Rows

21 Jul

This article provides new users the basics they need to understand Cassandra’s “column / super column / row” data model.

Though the focus is not on mechanics, this article assumes you are familiar with adding columns to and requesting data from existing keyspaces on Cassandra. If not, please see my earlier article on that topic.  Knowledge of JSON is also important to understand the data examples below – if you need help here, please see my earlier article on that topic.

Remember that a Cassandra column is basically a “name=value” pair* (e.g., “color=red”).  You can use multiple columns to represent data such as:

    "Price" : "29.99",
    "Section" : "Action Figures"

As you may have seen in my previous article, multiple columns can also be grouped in Cassandra “rows” to handle data  such as:

{
  "Transformer" : {
    "Price" : "29.99",
    "Section" : "Action Figures"
  }
  "GumDrop" : {
    "Price" : "0.25",
    "Section" : "Candy"
  }
  "MatchboxCar" : {
    "Price" : "1.49",
    "Section" : "Vehicles"
  }
}

The keys used to group related columns into rows in this example were “Transformer”, “GumDrop” and “MatchboxCar”.

In my earlier article, we looked at this data structure:

In JSON, this keystore->column family->row->column data structure would be represented like this:

{
  "ToyStore" : {
    "Toys" : {
      "GumDrop" : {
        "Price" : "0.25",
        "Section" : "Candy"
      }
      "Transformer" : {
        "Price" : "29.99",
        "Section" : "Action Figures"
      }
      "MatchboxCar" : {
        "Price" : "1.49",
        "Section" : "Vehicles"
      }
    }
  },
  "Keyspace1" : null,
  "system" : null

}

If you simply wanted to add other types of unrelated collections of information (e.g., “BugCollection” or “PaintColors”), you’d simply keep adding new keyspaces for each new collection.  However, if you needed to keep track of similar collections of data (e.g., your Ohio and New York toy stores instead of a single toy store) you’d need to turn to a different kind of Cassandra element: the “super column”.

To see super columns in action, inspect this keystore->column family->row->super column->column data structure as it appears in JSON:

{
  "ToyCorporation" : {
    "ToyStores" : {
      "Ohio Store" : {
        "Transformer" : {
          "Price" : "29.99",
          "Section" : "Action Figures"
        }
        "GumDrop" : {
          "Price" : "0.25",
          "Section" : "Candy"
        }
        "MatchboxCar" : {
          "Price" : "1.49",
          "Section" : "Vehicles"
        }
      }
      "New York Store" : {
        "JawBreaker" : {
          "Price" : "4.25",
          "Section" : "Candy"
        }
        "MatchboxCar" : {
          "Price" : "8.79",
          "Section" : "Vehicles"
        }
      }
    }
  }
}

This data could also be visualized like this:

Given its late appearance, you might expect that “Ohio Store” and “New York Store” would represent super columns that span multiple rows.   However, the opposite is true:  “Ohio Store” and “New York Store” are now the row keys and entries like “Transformer”, “GumDrop” and “MatchboxCar” have become super columns keys.

Like column keys, super column keys are indexed and sorted by a specific type (e.g., “UTF8Type”, ”AsciiType”, “LongType”, “BytesType”, etc.).    However, like row keys, super column entries have no values of their own; they are simply used to collect other columns.

Notice that the keys of the two groups of super columns do not match.  ({“Transformer”, “GumDrop”, “MatchboxCar”} does not match {“JawBreaker”, “MatchboxCar”}. )  This is not an error: super column keys in different rows do not have to match and often will not.

In a future article I will describe how to how to pass JSON structures that describe data in either row/column or row/super column/column format into and out of Cassandra – stay tuned!

* = We’ll ignore the timestamp element of Cassandra columns for now.  These timestamps are used to reconcile updates from multiple nodes, but don’t worry about that until you understand the whole column/supercolumn thing first.

Share
 
Comments Off

Posted in Beginner, Cassandra, JSON

 

Migrate a Relational Database Structure into a NoSQL Cassandra Structure (Part I)

20 Jul

This article beings to explore how to migrate a relational database structure (tables linked by foreign keys) into a NoSQL database structure that can be used in Cassandra.

If you do not know what Cassandra is, why NoSQL and Cassandra are important technologies or what JSON is and why you should know it, please click the links in this sentence to learn more about each topic before proceeding.

The Original Relational Database Structure

We are going to start with a very simple 1:N relational database structure. Our first two tables are “forests” and “famoustrees”.  Here is our data in tabular format:

forests:

famoustrees:

“famoustrees” is linked to “forests” using the “forestID” foreign key.  Notice that there are no famous trees in the “Lonely Grove” forest, one famous tree in the “100 Acre Woods” and two famous trees in the “Black Forest”.

If we were to represent the data in our database – call it our “biologicalfeatures” database – in JSON, it would look like this:

{
  "biologicalfeatures":
    {
    "forests" :
      {
      "forest003" :
        {
          "name" : "Black Forest",
          "trees" : "two million",
          "bushes" : "three million"
        },
      "forest045" :
        {
          "name" : "100 Acre Woods",
          "trees" : "four thousand",
          "bushes" : "five thousand"
        },
      "forest127" :
        {
          "name" : "Lonely Grove",
          "trees" : "none",
          "bushes" : "one hundred"
        }
      },
    "famoustrees" :
      {
      "tree12345" :
        {
          "forestID" : "forest003",
          "name" : "Der Tree",
          "species" : "Red Oak"
        },
      "tree12399" :
        {
          "forestID" : "forest045",
          "name" : "Happy Hunny Tree",
          "species" : "Willow"
        },
      "tree32345" :
        {
          "forestID" : "forest003",
          "name" : "Das Ubertree",
          "species" : "Blue Spruce"
        }
      }
    }
}

Denormalizing the Tables

To collapse the famoustrees table into our forests table, we need to move each famoustree entry underneath its forest entry.  We can also also remove the foreign “forestID” key from each famoustree entry – we don’t need that anymore.

However, we should retain the type of each famoustree entry we moved into the forest entry.  We can do this by adding an extra “type” value to each entry.

Finally, we could break out the original non-ID information in each forest entry into a typed section too.  We’ll tag each of these sections with a new ID of “generalinfo”.  (This is a Cassandra-friendly convention – we’ll get into this more below.)

Represented in JSON, our data now looks like this:

{
  "biologicalfeatures":
    {
    "forests" :
      {
      "forest003" :
        {
        "generalinfo" :
          {
          "name" : "Black Forest",
          "trees" : "two million",
          "bushes" : "three million"
          },
        "tree12345" :
          {
            "type" : "famoustree",
            "name" : "Der Tree",
            "species" : "Red Oak"
          },
        "tree32345" :
          {
            "type" : "famoustree",
            "name" : "Das Ubertree",
            "species" : "Blue Spruce"
          }
        },
      "forest045" :
        {
        "generalinfo" :
          {
          "name" : "100 Acre Woods",
          "trees" : "four thousand",
          "bushes" : "five thousand"
          },
        "tree12399" :
          {
            "type" : "famoustree",
            "name" : "Happy Hunny Tree",
            "species" : "Willow"
          }
        },
      "forest127" :
        {
        "generalinfo" :
          {
          "name" : "Lonely Grove",
          "trees" : "none",
          "bushes" : "one hundred"
          }
        }
      }
    }
}

Ready for Cassandra?

There are really only two types of JSON data structures that can be imported directly into Cassandra.  One is the
keystore->columnfamily->rowkey->column
data structure shown below:

{
  "keystore":
    {
    "columnfamily" :
      {
      "rowkey" :
        {
          "column name" : "column value"
        }
      }
    }
}

Add another layer and you get the other supported data structure
keystore->columnfamily (a.k.a. “supercolumnfamily”)->rowkey->supercolumn (a.k.a. “subcolumn”)->column
shown below:

{
  "keystore":
    {
    "columnfamily" :
      {
      "rowkey" :
        {
        "supercolumn" :
          {
          "column name" : "column value"
          }
        }
      }
    }
}

That’s it: if you can get your data to fit into one of those two JSON structures, your data is ready to be input into Cassandra.

You probably suspect that I wouldn’t have taken you this far if our forests data wasn’t ready for Cassandra, but please take a moment to scroll up and see if you can figure out whether our denormalized forests data uses supercolumns or not.

Let’s break it down:
biologicalfeatures -> forests
…matches the keystore->columnfamily structure used by both supported JSON structures.

As for the rest:
forest003 -> generalinfo -> (name=”Black Forest”)
…matches the rowkey->supercolumn->column structure used by the “supercolumn” supported JSON structure.

So, yes, we had to use supercolumns to denormalize the forests and famoustrees tables properly.

Next Steps

Next we’ll perform this type of analysis on the Northwind JSON structure exported in a previous article.

Doing this type of normalization by hand would be a large PITA, so DivConq created a utility to do this automatically. The article after that shows how to use that DivConq utility and a few more like it to complete the conversion of the Northwind JSON in Cassandra-ready format.

Soon after that I will also cover how to import the final JSON data directly into Cassandra – stay tuned!

Share
 
Comments Off

Posted in Beginner, Cassandra, nosql

 

What’s the difference between a supercolumnfamily and a columnfamily with subcolumns in Cassandra?

13 Jul

This article covers the difference between a supercolumnfamily and a columnfamily with subcolumns in Cassandra.

First, remember that in Cassandra terminology, “subcolumn” = “supercolumn” = “sub column” = “supercolumn”.

With that in mind, a “super column family” is really just a “column family…that contains super columns under its rows”.  (As opposed to a regular “column family” that merely contains rows without supercolumns.)

The confusion comes about because “super column family” entries look like this:

<ColumnFamily Name="Super1"
              ColumnType="Super"
              CompareWith="BytesType"
              CompareSubcolumnsWith="BytesType" />

..and plain old “column family” entries look like this:

<ColumnFamily Name="Regular1"
              CompareWith="BytesType" />

…both use a tag named “ColumnFamily” in Cassandra’s “storage-conf.xml” definition file.

Personally, I prefer using the term “Column Family” to cover both column families with rows that contain supercolumns as well as column families with rows that don’t contain supercolumns.  But if someone uses the term “super column family” they always mean “a column family that contains rows that contain supercolumns.”

Share
 
Comments Off

Posted in Beginner, Cassandra

 

What’s the Difference Between a a SuperColumn and a SubColumn in Cassandra?

12 Jul

This article covers the difference between a supercolumn and a subcolumn in Cassandra.

Let me cut to the chase: there is no difference.  They are two terms for exactly the same thing.

If you are familiar with a typical keystore->column family->row->super column->column structure, such as the one pictured below, then you could safely replace all instances of the phrase “super column” with “subcolumn” without changing the meaning.

The confusion around “super column” vs. “sub column” is fueled largely by the Cassandra configuration file.  In your “storage-conf.xml” file you will see XML “ColumnFamily” configuration elements like this:

      <ColumnFamily Name="Super1"
                    ColumnType="Super"
                    CompareWith="BytesType"
                    CompareSubcolumnsWith="BytesType" />

If this was was a plain old “ColumnFamily” entry, you would only see this:

      <ColumnFamily Name="Regular1"
                    CompareWith="BytesType" />

…but this is a “Super Column Family”, so there are two extra attributes:

  • ColumnType=”Super” to tell Cassandra that this column family will contain super columns.
  • CompareSubcolumnsWith=”BytesType” to tell Cassandra that our sub columns will be sorted through bit-by-bit comparison.

Confused?  If so, go back and read the last two bullets again while telling yourself:
“super column = sub column = supercolumn = subcolumn…”

Share
 
Comments Off

Posted in Beginner, Cassandra

 

Why Cassandra

09 Jul

In my previous two posts – Simple, Secure and Speedy part one and part two – I pointed out that one of the most limiting factors in scaling web applications is the database. So while we could keep it simple and reasonably secure using commonly used components (PHP, MySQL, lighttpd, Nginx), the scalability story is where the issues begin. In part two I pointed out that both the application servers and the reverse proxy servers could scale fairly well. But what about the database?

Architecturally there are three topics worth considering when looking to scale a database. First is performance, second is local availability, and third is global availability. Clustering a relational database, such as MySQL, does provide performance and local availability – but it does not provide global availability. In other words, it does not span data centers and geography.

There are solutions for spanning data centers with relational databases – SQL Server has some replication features to help, as do a few other projects. From a cloud perspective Amazon Relational Database Service is an interesting solution to the problem.

While I’m glad there are possible solutions out there, my main issue is that relational databases really are not designed to be distributed. I think we can do better. As a veteran of nosql (aka post-relational) database development I think that ‘nosql’ style databases will be easier to develop on and deploy/manage in a distributed (multi-data center / global) world. Especially if you want to maintain provider neutrality or if you want on-premises (or hybrid) deployments.

BTW, if you think nosql hasn’t been around long – think again – MUMPS was an excellent example of nosql concepts and is alive and well in the form of GT.M and at the roots of InterSystems Caché®. There is even multi-data center support. As much as I appreciate these projects, and the many other nosql projects out there, at the end I think Apache Cassandra has some of the best ideas in terms of balance of speed and accuracy and overall good design.

As such, you’ll be seeing more many blogs about Cassandra. Do I see any drawbacks to Cassandra? Well yes, biggest of which is secondary index support is all manual – but that is par for the course with nosql. And lack of stored procedures, though Hadoop and Pig kind of address that. Other than that my biggest gripe if the poor naming of the data model. The words ‘columns’, ‘rows’ and ‘super’. If you feel this way too and are still trying to get Cassandra’s data model then consider this revised naming scheme – once you get it then moving back to the columns/super columns terminology may be easier.

Here is the short and sweet intro to Cassandra with the revised names:

Cassandra can hold one or more data stores per node. Each data store is named. Each data store has one or more families of data. Each family contains zero or more family keys. A family key is always a string value. When present a family key contains one or more field keys. The data stores and the families they contain must be given a name in the storage-conf.xml file. Furthermore the data type of the field key must be declared.

There are two kinds of families: Standard and Extra. Using the Cassandra CLI you can interact with the two families as such:

Standard Family

set <store_name>.<family name>['<family key>']['<field key>'] = '<value>'

Extra Family

set <store_name>.<family name>['<family key>']['<extra key>']['<field key>'] = '<value>' 

Again – the ‘<store_name>.<family name>’ part needs to be declared in the config. The data type of the family key is always string. The data type of the field key (and the extra key, if present) must also be declared.

The definitions look something like this:

<Store Name="<your store name>">
<Family Name="<your family name>" CompareWith="<field key type>" />
<Family Name="<your family name>" Type="Extra" CompareWith="<extra key type>" CompareSubcolumnsWith="<field key type>" />
</Store>

That’s it. There are no other family types, no other configuration to worry about – very simple once you take the words ‘column’, ‘super’ and ‘row’ out of the mix. Look back at the various descriptions of the Cassandra data model and see if there is more clarity now.

Also, these diagrams might help with the terminology – thanks to Chaker Nakhli for that helpful article.

On this blog we will provide some real world examples of smallish applications modeling data in Cassandra.

Share
 

What is Cassandra’s “system” keystore used for?

07 Jul

The “system” keystore in Cassandra is a built-in keystore. Unlike built-in databases or keystores in other database software Cassandra’s system keystore is NOT used for long-term storage of significant amounts of configuration information. Instead, this keystore is mostly used to coordinate updates of user data between multiple Cassandra systems.

There are essentially two groups of information stored in the Cassandra system keystore. One is information about the current node, such as the “token”, “cluster name” (e.g., “Test Cluster” in the default “storage_conf.xml”), and a “automatic bootstrap” flag that allows nodes added for performance reasons in larger clusters to automatically grab data and come up when ready. All of these settings are originally configured in your “storage_conf.xml” file.

The second type of information stored in the Cassandra system keystore is used to cache and forward updates for other nodes, especially nodes that are temporarily down. (The second type of information is known in the Cassandra world as “hinted handoff” data.)

On disk you will find the physical files that make up Cassandra’s system keystore in locations such as:
D:\var\lib\cassandra\data\system\LocationInfo-1-Data.db
D:\var\lib\cassandra\data\system\LocationInfo-1-Filter.db
D:\var\lib\cassandra\data\system\LocationInfo-1-Index.db
D:\var\lib\cassandra\data\system\LocationInfo-2-Data.db
D:\var\lib\cassandra\data\system\LocationInfo-2-Filter.db
D:\var\lib\cassandra\data\system\LocationInfo-2-Index.db

I’m using Cassandra 0.6.2 on Windows 7 here – you could imagine where this would be on a *nix system ;)

If you’ve made it this far you may notice that Cassandra uses a different subfolder in “data” for each keystore and THREE files for each Column Family within a keystore. The Cassandra system table uses the same structure: one data file of “sorted strings” for the key/value pairs that make up each column, one index file made up of column keys plus the location of the data (a.k.a. the “offset”) in the data file and one “filter” file simply containing all the keys (in “bloom filter” format, if you’re interested).

Long story short – don’t worry about Cassandra’s system folder until you’re ready to work with multiple Cassandra servers across a distributed cluster. Even then, if things are running normally, you still shouldn’t have much*, if any, interaction with the contents of this built-in keystore.

* = unless you decide to change the name of your cluster.

Share
 
Comments Off

Posted in Beginner, Cassandra

 

How to Add and Retrieve Data from a Cassandra Database

06 Jul

This article describes how to create a new keyspace on a Cassandra database server, how to add data to that keyspace and how to run some simple queries against that data.

If you do not know how to install and run Cassandra, please read my earlier article on that subject first.

You should also know a little about XML: specifically how to find and fix unmatched XML tags in case you typo during the following instructions.  (Hint: file-by-file backups are a good idea until you get comfortable.)

Create a New Keyspace on Cassandra

First, sign on to your Cassandra server using the “cassandra-cli” client.  Use the “show keyspaces” command to ensure you have a live connection to the server and to make sure the keyspace you are about to add doesn’t already exist.

cassandra> show keyspaces
Keyspace1
system

These two keyspaces are automatically installed when you installed Cassandra and are completely independent of one another – like separate databases on a relational database system would be.  A diagram of two separate keyspaces in our Cassandra database would look like this:

We want to add a new keyspace called “ToyStore”.  Once we’re done, we’d expect our diagram to look like this:

Now, we’re working with the current stable version (“0.6.3″ at the time I wrote this article) , so we’ll need to add our new keyspace by hand.   Close down your server and the client and then go into your Cassandra “conf” directory.  Open the “storage-conf.xml” file with your favorite text editor.

Around line 60 you’ll see a tag that looks like this:

<Keyspaces>
  <Keyspace Name="Keyspace1">

When you scroll further you’ll see come to the end of the “Keyspaces” section.  Note that there is no ‘<Keyspace Name=”system”>’ entry; all entries here are for “user” databases.

After the XML tags for “keyspace1″, please add the following XML snippet to create an new, empty keyspace called “ToyStore”.   This snippet also creates a “column family”  (if you’re a relational database user think “table” for now) called “Toys”.

<Keyspace Name="ToyStore">
  <ColumnFamily Name="Toys" CompareWith="UTF8Type" />
  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
  <ReplicationFactor>1</ReplicationFactor
  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
</Keyspace>

The “CompareWith” attribute in the “ColumnFamily” tag you just added controls how row keys are indexed and sorted.  The “UTF8Type” value key indicates that you’re indexing by UTF8 characters.  Other possible values include”AsciiType”, “LongType” (64-bit long integers) and “BytesType” (straight bit-to-bit comparison – the default value).

The other tags here (“ReplicaPlacementStrategy”, “ReplicateFactor” and “EndPointSnitch”) set a few replication options we’ll get into later.

Save your changes and restart your Cassandra server.  If you can connect with the Cassandra client, issue the following command and see “ToyStore” in the result you’ve succeeded in adding your first keyspace to Cassandra!

cassandra> show keyspaces
Keyspace1
system
ToyStore

Add Data To An Existing Keyspace on Cassandra

Now that we have a new “ToyStore” keyspace it’s time to add some data.  If you were watching closely you’ll notice that we did more than add a keystore in the previous step: we added our first “column family” too.  (Think “table” if you’re coming from a relational database background.)

To get started adding data, restart your Cassandra client and use the following syntax to add six name/value pairs to the “Toys” column family of your new “ToyStore” keyspace.

cassandra> set ToyStore.Toys['Transformer']['Price'] = ’29.99′
Value inserted.
cassandra> set ToyStore.Toys['GumDrop']['Price'] = ’0.25′
Value inserted.
cassandra> set ToyStore.Toys['MatchboxCar']['Price'] = ’1.49′
Value inserted.
cassandra> set ToyStore.Toys['Transformer']['Section'] = ‘Action Figures’
Value inserted.
cassandra> set ToyStore.Toys['GumDrop']['Section'] = ‘Candy’
Value inserted.
cassandra> set ToyStore.Toys['MatchboxCar']['Section'] = ‘Vehicles’
Value inserted.

If you run a “help” command from the Cassandra client you will see the following syntax for the kind of “set” command we just used:

set <ksp>.<cf>['<key>']['<col>'] = '<value>'

Let’s break this command syntax down using one of the commands we just typed.

set ToyStore.Toys['Transformer']['Price'] = '29.99'

According to our command syntax, the command we typed meant this:

  • ksp = KeySpace = “ToyStore”
  • cf = Column Family = “Toys”
  • key = Row Key (an indexed key which links multiple columns) = “Transformer”
  • col = Single Column Name (the name in a single name/value pair) = “Price”
  • val = Single Column Value (the value in a single name/value pair) = “29.99″

These six commands created a total of three rows in the “Toys” column family: “Transformer”, “GumDrop” and “MatchboxCar”.  Within each row you created two columns: “Section” and “Price”.   Sketched out in a diagram the data you inserted would look something like this:

Within the “Toys” column family, you could also represent this data in JSON like this:

{
  "Transformer" : {
    "Price" : "29.99",
    "Section" : "Action Figures"
  }
  "GumDrop" : {
    "Price" : "0.25",
    "Section" : "Candy"
  }
  "MatchboxCar" : {
    "Price" : "1.49",
    "Section" : "Vehicles"
  }
}

Starting to make sense? Now, let’s try to pull this data back.

Retrieve Data From An Existing Keyspace on Cassandra

Let’s start by counting the number of name/value pairs (i.e., “columns”) stored under one of the row keys we just inserted.

cassandra> count ToyStore.Toys['GumDrop']
2 columns

If you followed directions, the answer will be “2 columns”, whether you use “GumDrop”, “Transformer” or “MatchboxCar” as your column key.

Now try spelling out the row key in all lowercase.

cassandra> count ToyStore.Toys['gumdrop']
0 columns

Yes, Cassandra row keys are case-sensitive. Consider yourself warned, especially if you’re coming from a database environment where cases are insensitive.

Now trying spelling out the row key that doesn’t exist.

cassandra> count ToyStore.Toys['RedMatterBall']
0 columns

Notice that you didn’t get a “no column exists” error on your count statement; instead you were simply told that zero name/value pairs exist for your non-existent row key.

Now that you know to be careful with the exact name and case of your row keys, let’s pull back the data in a particular row instead of just counting how many columns it contains. To do this, use the “get” command as shown below.

cassandra> get ToyStore.Toys['GumDrop']
=> (column=Section, value=Candy, timestamp=1278132493790000)
=> (column=Price, value=0.25, timestamp=1278132306875000)
Returned 2 results.

The two “column” and “value” entries look familiar but there’s a third item in each of our columns: “timestamp”. That value represents the time when you made each column entry. Timestamp may not mean much to us yet (we will safely ignore it for another article or two), but timestamp will mean a great deal to us when we start merging column inserts/updates from two or more Cassandra database nodes.

By the way, here’s how you could represent the timestamp on each column in your diagram:

But back to our data retrieval task. Before we move on, try at least one row key that doesn’t exist.

cassandra> get ToyStore.Toys['RedMatterBall']
Returned 0 results.

Again, note that Cassandra reports that there are “0 results” for this row key, not that this row key doesn’t exist.

The last thing we’re going to do in this article is drill down into an existing row and only pick out one column (i.e., one name/value pair).

cassandra> get ToyStore.Toys['GumDrop']['Price']
=> (column=Price, value=0.25, timestamp=1278132306875000)

Now try this with a valid row key and an invalid column.

cassandra> get ToyStore.Toys['GumDrop']['Taste']
Exception null

This time we got an error rather than a “count of zero” message!

Relational database folks, are you starting to see the pattern? (Hint: Using non-existent row keys is like executing a “SELECT COUNT(*) FROM DB” with a WHERE clause that matches nothing, but using non-existent column names is like executing a query with invalid fields.)

For more information about columns, row keys and another Cassandra structure called a “super column” (or “subcolumn”), please see my recent article on that subject.

Troubleshooting

If you encounter a “Fatal error: Invalid replicaplacementstrategy class org.apache.cassandra.locator.RackUnawareStrategy” while trying to start up your Cassandra server, this may be caused by spaces or newlines in your “ReplicaPlacementStrategy” keyspace definition. (Remember this definition is found in your “storage-conf.xml” file.) Make sure this line looks like: “<ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>” – with NO spaces or other spacing characters anywhere within this tag!

If you encounter a “supercolumn parameter is not optional for super CF” when you try to issue an otherwise valid “set” command from the Cassandra client, you probably incorrectly defined the column referenced in your set command as a “super” column in your keyspace definition. To fix, take out the ‘ColumnType=”Super”‘ and ‘CompareSubcolumnsWith=”UTF8Type”‘ attributes from that column’s definition in your “storage-conf.xml” file, then restart the Cassandra server.

Share
 
Comments Off

Posted in Beginner, Cassandra

 

Simple, Secure and Speedy – Part Two

05 Jul

Amazon’s EC2 (Elastic Compute Cloud) enables an user to deploy a fully customized Linux or Windows server in the Amazon cloud. Beyond the security provided by the built-in (Windows or Linux) firewall, EC2 provides another layer of firewall called a security group.

A security group may be configured to filter IP traffic – to allow access to only certain ports from certain Internet address ranges. IP traffic that does not pass the filter rules will never even touch the server. All servers running in a security group assume the same filter rules.

Furthermore, a security group may be configured to allow access only from another security group. Thus enabling very secure multi-tier deployments in Amazon’s EC2.

private tier in Amazon's EC2

Read the rest of this entry »

Share
 

Installing and Running the Cassandra Database

05 Jul

This article provides instructions to download, install and run two major components (one client, one server) of the Cassandra database on your Windows system.

If you do not know what Cassandra is, please read my earlier article on this subject first.

Prerequisites

Cassandra is written in Java.  Please make sure you have installed Java version 6 or a more recent version before proceeding.

Installing Java also usually sets up proper values of the JAVA_HOME and PATH environment variables.  If you see complaints about “JAVA_HOME” or you see errors like “‘java’ is not recognized as an internal or external command”, please see the “JAVA_HOME and PATH Environment Variables” section near the bottom of this article.

Downloading and Expanding Cassandra

Download a package of Cassandra binary-format programs from http://cassandra.apache.org/download/.

Save this file as “apache-cassandra-0.6.3-bin.tar.gz” (with the version number set appropriately).

Unpack this *.tar.gz file using PeaZip or your favorite archive utility into a new folder of your choice.  I used a new folder called “C:\Work\apache-cassandra-0.6.3″.  Your unpacked files should look like this in Windows Explorer:

Running the Cassandra Server

Open up a command-line window.  Run this command so you’ll be able to tell which window is running the client and which is running the server later:

TITLE CassServer

CD into your “C:\Work\apache-cassandra-0.6.3″ folder, type “bin\cassandra” and hit “Enter”.  Behind the scenes this kicks off a batch file (remember, this is a Java application) that checks to make sure you’ve done your homework on the prerequisites and then attempts to launch the server.

If all goes well, you will see something like this.

C:\Work\apache-cassandra-0.6.3>bin\cassandra
Starting Cassandra Server
Listening for transport dt_socket at address: 8888
INFO 14:20:21,220 Auto DiskAccessMode determined to be standard
INFO 14:20:22,482 Saved Token not found. Using 34630284372509656815837333105728
952419
INFO 14:20:22,482 Saved ClusterName not found. Using Test Cluster
INFO 14:20:22,492 Creating new commitlog segment /var/lib/cassandra/commitlog\C
ommitLog-1278012022492.log
INFO 14:20:22,582 LocationInfo has reached its threshold; switching in a fresh
Memtable at CommitLogContext(file=’/var/lib/cassandra/commitlog\CommitLog-127801
2022492.log’, position=419)
INFO 14:20:22,602 Enqueuing flush of Memtable-LocationInfo@14471083(169 bytes,
4 operations)
INFO 14:20:22,612 Writing Memtable-LocationInfo@14471083(169 bytes, 4 operation
s)
INFO 14:20:23,313 Completed flushing C:\var\lib\cassandra\data\system\LocationI
nfo-1-Data.db
INFO 14:20:23,363 Starting up server gossip
INFO 14:20:23,544 Binding thrift service to localhost/127.0.0.1:9160
INFO 14:20:23,554 Cassandra starting up . . .

Note that the Cassandra server will NOT return to the command prompt if it started up successfully. If you see a “Cassandra starting up…” message (without a return to the command prompt), leave the server be and try running the Cassandra client.

Running the Cassandra Client

Open up a command-line window. Run this command so you’ll be able to tell which window is running the client and which is running the server later:

TITLE CassClient

CD into your “C:\Work\apache-cassandra-0.6.3″ folder, type “bin\cassandra-cli -host localhost -port 9160″ and hit “Enter”. Behind the scenes this kicks off a batch file (remember, this is a Java application) that checks to make sure you’ve done your homework on the prerequisites and then attempts to launch the server.

If all goes well, you will see something like this.

C:\Work\apache-cassandra-0.6.3>bin\cassandra-cli -host localhost -port 9160
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
Type ‘help’ or ‘?’ for help. Type ‘quit’ or ‘exit’ to quit.
cassandra>

Yes, it’s an interactive command prompt – that’s exactly what we were hoping for.

Trying Some Easy Commands

I’ll cover how to actually store and retrieve data in a future article, but let’s at least make sure our client is really talking to our server. Run the following three commands to ask the server to return some basic information about the Cassandra server environment to your Cassandra client.

cassandra> show api version
2.1.0
cassandra> show cluster name
Test Cluster

Unlike relational databases like SQL Server, Cassandra doesn’t organize all its structures and data into “databases” – instead it uses a similar concept called “keyspaces”.

To list all the keyspaces on your new Cassandra server, run the following command.

cassandra> show keyspaces
Keyspace1
system

This server has two keyspaces: a built-in “system” keyspace which contains information about this Cassandra server node’s current state and a user keyspace called “Keyspace1″ that holds data you can insert or read.

Now there’s no “boss key” with Cassandra, so you’ll also need to know how to shut everything down in case you’re trying this on the job at a SQL Server shop. ;)

Shutting Down

To shut down the Cassandra server, go to your “CassServer” window and hit “Ctrl+C”. Answer “y” when asked if you want to terminate the batch file.

To quit the Cassandra command-line client, go to your “CassClient” window and type “exit” at the command prompt. (“Ctrl+C” won’t work here.)

Next Steps: Adding and Retrieving Data

If all is well, please proceed to my article about how to add and retrieve data from your new Cassandra database system.

Troubleshooting

If you get a “NoClassDefFoundError” message which trying to start either the Cassandra client or server, please see my earlier article on that subject.

If you get a “Not connected to a cassandra instance.” message while using the Cassandra client, you probably forgot to specify “-host localhost -port 9160″ on the command line.

JAVA_HOME and PATH Environment Variables

Cassandra’s applications (and many other Java applications) need an environment variable called “JAVA_HOME” defined in order to run.  If you only run a single version of Java on your system, you will probably want to set the value of this environment variable at the system level to something like “C:\Program Files\Java\jre6″ or “C:\Sun\SDK\jdk” (no “bin”, no “lib” – this is just the home directory).

You should also make sure that any Java application you start from the command-line can find its necessary libraries by setting the PATH environment variable appropriately.  If you only run a single version of Java on your system, you may need to APPEND to the value of the PATH environment variable at the system level.  Your appended values will be something like “;C:\Program Files\Java\jre6\bin\” or “C:\Sun\SDK\bin” (don’t point to just the home directory here).

On my Windows systems I often have multiple versions of Java installed so I use an extremely short Windows batch called “prepforjava.bat” which contains two lines to set these environment variables as needed:

SET PATH=%PATH%;C:\Program Files\Java\jre6\bin
SET JAVA_HOME=C:\Program Files\Java\jre6

This batch file is stored in another folder also referenced by my system-level PATH environment variable so I can call it from any command-line window at any time.

Share
 
Comments Off

Posted in Beginner, Cassandra

 

How do I fix a “NoClassDefFoundError” while attempting to start Cassandra?

03 Jul

When I first tried to run “cassandra-cli” from the “bin” directory after unpacking the cassandra 0.6.3 package on my Windows system, I got a “NoClassDefFoundError” error.  The exact command, context and error are shown below.

C:\Work\apache-cassandra-0.6.3\bin>cassandra-cli -host localhost -port 9160
Starting Cassandra Client
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/cassandra/
cli/CliMain
Caused by: java.lang.ClassNotFoundException: org.apache.cassandra.cli.CliMain
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
Could not find the main class: org.apache.cassandra.cli.CliMain. Program will e
xit.

Java veterans would immediately suspect a problem with the class library search path with this type of error. I did too, so I added some environment variable echo statements to the “cassandra_cli.bat” file and found that the batch file was not pointing to the “lib” directory where Cassandra’s “*.jar” files were kept.

Normally, I’d be tempted to fix this by changing:

for %%i in (%CASSANDRA_HOME%\lib\*.jar) do call :append %%~fi

…in line #28 of my default “cassandra_cli.bat” file to:

for %%i in (%CASSANDRA_HOME%\..\lib\*.jar) do call :append %%~fi

…so my batch file could go back far enough to find the appropriate Cassandra class paths.

However, please don’t do that. Instead, Cassandra assumes that you’ll be running all your command-line utilities from the root Cassandra directory (e.g., “C:\Work\apache-cassandra-0.6.3″ instead of “C:\Work\apache-cassandra-0.6.3\bin”).

Instead, please just “cd ..” back to your root Cassandra folder and prepend “bin\” before your commands. Better results are shown below.

C:\Work\apache-cassandra-0.6.3>bin\cassandra-cli  -host localhost -port 9160
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
Type ‘help’ or ‘?’ for help. Type ‘quit’ or ‘exit’ to quit.
cassandra>
Share
 
Comments Off

Posted in Beginner, Cassandra, Troubleshooting