Cassandra + Eucalyptus

We all should know that Cassandra really likes dedicated hardware. Take a bunch of ram and fast SSDs… divide them up among your cluster and SMILE! However, I’ve been very interested in the on-demand scalability that AWS/EC2 offers. It seems like the answer to the wide demand variance that popular web services could see.

I’ve installed and configured eucalyptus and I’m going to document building an image for Cassandra, and configuring Cassandra for use in Eucalyptus. I will be trying to scale Cassandra instances into the Amazon cloud with a scaling group, if possible. Eucalyptus is capable of “hybrid cloud” with AWS/EC2, but I’m not sure just how diverse the feature set is. This experiment is meant to fill that knowledge gap!

EDIT: One question answered! Cloudbursting is the term for running an application on your private cloud and expanding into the public cloud (like AWS) to handle load spikes! 🙂

Eucalyptus supports just this! I’m so very excited!

Preface: Install CentOS 6.4 on your machine(s), Install Eucalyptus, Install CentOS 6.4 image from eustore

Creating custom images can be a little tricky, especially when you do a yum update. I have found udev rules keep reinstalling themselves against my will. Here is a description of everything you should check before doing a euca-bundle-instance command: http://www.eucalyptus.com/docs/eucalyptus/3.4/image-guide/ig_task_prepare_image.html

Then you install all the software you want. I installed oracle JRE 1.7, JNA, and datastax community rpm repo. I am using a 3GB root, which is 50% full currently. Once you have confirmed that the image is ready, you can use euca-bundle-instance as described here: http://www.eucalyptus.com/docs/eucalyptus/3.4/index.html#image-guide/img_task_modify_existing_instance_store_image.html

I have found that when loading an image (initial root image remapped to 3-5GB as you see fit), it is best to specify the kernel and ramdisk even though it isn’t necessary. It might be in my head, but it seems to be a problem for me.

I’ve run into hiccups with euca (no metadata) so I’ve got some work yet to do! 🙂

EDIT: Fixed networking issues… iptables rules for 169.254.169.254 made it into my NC’s (wtf?) PROBLEM SOLVED!

The easiest way to make your instance if you are good with CentOS6.4 is to use eustore to download a 6.4 image. Then you can update that and modify a running instance as desired! Once you have it set up exactly how you want, you can prepare the instance: http://www.eucalyptus.com/docs/eucalyptus/3.4/image-guide/ig_task_prepare_image.html

Pay close attention to the udev rules which are changed on update!!!

Then you can bundle the running instance as a new image bundle: http://www.eucalyptus.com/docs/eucalyptus/3.4/index.html#image-guide/img_task_modify_existing_instance_store_image.html

Then register the image and make sure the user has permissions (if you use admin to add the instance, modify the image attributes!)

I used a simple bash script that detects the instance Name tag, and mounts (or doesn’t mount) volumes depending on that. Make sure permissions are set correctly after mounting. Then it uses uses a replace regex to fill in the seeds and node ip address in cassandra.yaml. It was a pretty simple process and it is really easy to migrate your seeds to larger stores when needed. I can start up all my seeds with a single command line, and start an auto-scaling group of nodes as needed as well!! I’ve not load tested my SSD-powered 4 “cpu” 4GB RAM instances yet, but I had nice performance from the dual core version!

It is completely persistent data although I need to configure an instance for Datastax’s fantastic OpsCenter (which now supports 2.0!!) so I can do easy backups. Hopefully saving backups to separate volumes won’t be hard!

Cassandra requests explained (partition, clustering, requirements) + FREE TRAINING!

One thing I’ve struggled with is request requirements. I never fully understood them until taking the free course at https://datastaxacademy.elogiclearning.com/

 

First to explain the partitioning key (aka primary key) vs clustering keys:

(partitioningkey, optional_clusteringkey1, optional_clusteringkey2)

The partitioning key can be complex (ie, a composite)

( ( partitioning_key1, partitioning_key2), optional_clusteringkey1, etc)

 

So you probably understand that if you wish to request a certain key, you must request a certain primary key.

 

So for a table with this key definition (colA, colB, colC), you must request something like this:

select * from table where colA = ‘something’;

 

This applies to composite primary keys as well! Consider this key definition: ( (colA, colX), colB, colC )

select * from table where colA = ‘something’ and colX = ‘something’;

You have to specify both if you want to specify one!!

 

However I never understood clustering columns correctly. Again using key definition: (colA, colB, colC)

If you specify all keys, of course it works fine

select * from table where colA = ‘something’ and colB = ‘something’ and colC = ‘something’;

If you specify them in order and it works fine because they are grouped on disk by colB and THEN colC, which allows the following to work:

select * from table where colA = ‘something’ and colB = ‘something’;

If you try to skip one of the clustering columns, it will not work because it would have to dive into each of the skipped columns and search for the third value. This would be very expensive. I’m unsure if you could enable filtering to make it work anyway, but it shouldn’t be done even if that works. If you have to enable filtering, you are doing it wrong! You need to create a new table instead with the data grouped the way you want to pull it. For instance, the following will not work!

select * from table where colA = ‘something’ and colC = ‘something’;

I’m really glad I got that behind me. That was some voodoo when I was playing around with cqlsh trying to learn cassandra.