C* Summit 2013: The World’s Next Top Data Model

Fantastic video on data modeling in Cassandra! I think I like this even more than the last!! It is nice to see such a concentration of skill in a community!


This is part 3 of a series! Although they are independent, if you find it seems to go too fast, or you have to stop and look up things, go back and watch #2, or even all of them in order!

1: The Data Model is Dead! Long Live the Data Model http://www.youtube.com/watch?v=px6U2n74q3g

2: Become a Supermodeler https://www.youtube.com/watch?v=qphhxujn5Es


Data Modeling for Cassandra

Overcoming trepidation over eventual consistency aside, trying to model for Cassandra and similar NoSQL databases has been the steepest hill to conquer for me.

I have 10 years in very amateur RDB experience and am only now taking my first real database course in college (which up until half way through, I could have taught). I’m not some data pro, but the ideas around the aggregate data model are an entirely different beast to conquer compared to the relational model. I’m going to try to give a brief overview of the methodology for new-comers.

  1. Stop trying to port your tables to Cassandra. Quit it. It is likely a waste of time. Spend the time thinking about the requests that are made. The flexibility of the RDB is that you can join tables in very complex ways to force the data to fit whatever query you are attempting to accomplish. C* takes another route. You model your data for how you will use it, duplication be damned. Some denormalization should not be a major factor in the database design. If the pulls are intelligently designed, and compartmentalized, denormalization will be slight.
  2. AGAIN, stop fighting the denormalization. Hard disks are CHEAP, and SSD’s are like ultra-cheap RAM for your database. Linux supports SSD-cache backed raid arrays now so unless the sheer mass of your data is your motivator for using NoSQL, rather than the performance scalability and efficiency, then get over it. Stop trying to design in data objects and start designing in aggregates.
  3. Once you have started building tables for the various pulls you make, you will likely find the chief data duplication going on is in the form of UUID and TimeUUID’s. Likely they are simply indicators for relations. This is fine so long as the tables are designed to be task oriented so that for any given task, you make as few pulls as possible.
  4. Writes: Yes writes can become a chore. Chances are high that you are smart enough to compartmentalize your data code. By keeping all the code together, when you have to do an update to several tables for consistency, you won’t have to worry about something getting left behind. Avoid the urge to write a “quick little hack”. Everything tends to be used far longer than anticipated and it could cause a terrible headache down the road. Treat your database like you want it to treat your data! Fortunately for us, Cassandra is amazing at write speed so when you have to push some changes, you don’t push the load up too much.

A couple of great resources I’ve come across are:

Thanks for stopping by!!