Dave Johnson on open web technologies, social software and software development
« WIP #6: Feed-based... | Main | How Wayin does Cloud... »
Newer versions of Apache Cassandra include CQL, an SQL-like query language that supports both query, update and delete statements as well as the Data Definition Language (DDL) statements like create and alter for tables and indexes.
You can create tables (known as column families in Cassandra lingo) just like you can in a relational database, but there are some caveats. One caveat is this: if you want to sort and use ORDER BY in queries on a table, then you will have to use composite-key in that table and that composite key must include the field that you wish to sort on. You have to decide which fields you wish to sort on when you design your data model, not when you formulate queries. Another caveat is that, with Cassandra 1.1, there is no support for secondary indexes on composite-keyed tables. That means you can only query on the fields in the composite-key and in certain specific ways. More on that later.
In version 1.1, Cassandra supports (at least) two different models for storing data. You can use a single primary key in your table, or you can use a composite key. Cassandra stores your data differently for these two cases and the queries that you can perform on these two types of tables vary as well.
Ill explain how simple-keyed and composite-keyed differ with some examples. The example is the Bite table, which holds a chunk of data identified by an ID and sorted by an integer score value. I dont want to get into details, but in the product I'm working on, Wayin Hub, a Site contains Feeds and Feeds contain Bites, which might represent Tweets, RSS items or other social network activities. Hopefully, that will be enough background for you to understand the examples below.
With a single keyed table you have a row for each entity that you store and a column in that row for each field of the entity. For example, to represent the Bite table as a single-keyed table it would be defined like so:
create table bite ( id varchar PRIMARY KEY, feedid varchar, score bigint, data varchar ); create index bite_feedid on bite (feedid); create index bite_score on bite (score);
We need those id, feedid and score fields so we can look up bites by those values. The data field is used to store a JSON representation of other data we associate with each Bite. For example, if you only have three Bites in the table, heres what the results of a select * from bite might look like:
id | feedid | score | data ------+----------+---------+---------------------- bite2 | feed0 | 101 | { status : APPROVED, ... bite3 | feed0 | 102 | { status : DENIED, ... bite1 | feed0 | 100 | { status : APPROVED, ...
The way the data is stored in Cassandra would look about the same, as illustrated in the diagram below. Each table row corresponds to a Row in Cassandra, the id of the table row is the Cassandra Row Key for the row. Each value in the row is a Cassandra Column with a key and a value. If you add more table rows, you get more Cassandra Rows.
Now lets get back to the topic of this post and that caveat that I mentioned earlier.
If we want to sort data in a table, then we need to use a composite-keyed table. With a composite-keyed table you define a composite-key made up of multiple fields from the table. The first key is known as the partition-key. To sort by score we also include the score in the composite key. And since it is possible for two Bites with the same partition key to occur at the very same time, we also include varchar ID to ensure uniqueness.
Heres how the Bite table is defined. The composite-key is the list of three fields in PRIMARY KEY parentheses.
create table Bite ( partkey varchar, score bigint, id varchar, data varchar, PRIMARY KEY (partkey, score, id) ) with clustering order by (score desc);
And if you had three Bites in the table the query select * from bite would return this:
partkey | score | id | data ----------+---------+-------+---------------------- feed0 | 102 | bite3 | { id : bite2, ... feed0 | 101 | bite2 | { id : bite3, ... feed0 | 100 | bite1 | { id : bite1, ...
The surprise is how this table is stored in Cassandra. Instead of storing a Cassandra Row for each table row, the data is stored as one row. The more Bites you add to the table, the more Cassandra Columns are added to that Row. The diagram below illustrates how this works for the three Bites of data above. There is one Row with key of feed0. And there is one Column for each table row of data. Each Column uses a key that combines the score and id plus a string that indicates what field is stored in the Column Value, which in our case is only the data field.
With Cassandra 1.1, if you want to select a single Bite, you must know all of the composite-keys. Heres an example query that selects a single Bite:
select data from bite where partkey=feed0 and score=101 and id=bite2
To get latest Bites in a Sites Feed, you specify only the partition-key and ask for ordering by score, like so:
select data from bite where partkey=feed0 order by score desc limit 20
If you try to query without specifying the partition key and the score, you will get an error message. For example, this query:
select data from bite where id=bite2 limit 20
Would give you this error message:
Bad Request: PRIMARY KEY part id cannot be restricted (preceding part score is either not restricted or by a non-EQ relation)
That means we cant look up Bites by a single ID. Thats not very convenient but thats the way it is with Cassandra 1.1 which does not allow additional indexes on composite-key tables.. If you really want to lookup Bites by id, you have to create an entirely new simple-keyed table with Bite id as the primary key and use that table to look up the a Bites partKey and score. This problem is fixed in Cassandra 1.2 because it allows secondary indexes on fields in composite-key table. Lets talk about that.
Cassandra 1.2 comes with support for secondary indexes on composite-keyed tables, but you cannot create a secondary index on keys that are already part of the composite-key. So, if we want to be able to look-up Bites by ID, then we must add a second and redundant biteid field like so:
alter table Bite add biteId varchar; create index Bite_biteId on Bite (biteId);
Inside Cassandra, the the data would look like this, a new field in the table means a new Column in the row, as show below:
And with that secondary index we can support queries like this:
select data from bite where biteid=bite2
In the DataStax Cassandra 1.2 docs, it says CQL3 transposes data partitions (sometimes called "wide rows") into familiar row-based result sets, dramatically simplifying data modeling.
I believe that means that all tables are stored the way that composite-keyed tables are stored. According to the docs, legacy tables from Cassandra 1.1 are supported in 1.2 and, if you want, you can still create Cassandra 1.1 style tables by using the compact storage attribute. For example, to create the Bite table with a the Cassandra 1.1 table model and a single primary key youd do this:
create table bite ( id varchar PRIMARY KEY, feedid varchar, score bigint, data varchar ) with compact storage;
And thats all Ive got on this topic.
I wrote this up to help myself understand how composite-keyed tables work in Cassandra, so Id love any feedback you might have and especially if you think Ive got concepts or terminology wrong. Thanks for reading.
You can read more about Cassandra 1.1 tables on the Datastax site:
http://www.datastax.com/docs/1.1/ddl/column_family
More about Cassandra 1.2 tables:
http://www.datastax.com/docs/1.2/ddl/table
More about Cassandra 1.1 legacy tables in Cassandra 1.2
http://www.datastax.com/docs/1.2/ddl/legacy_table
Also, I found these posts by Brian ONeill very helpful:
http://brianoneill.blogspot.com/2012/09/composite-keys-connecting-dots-between.html
http://brianoneill.blogspot.com/2012/10/cql-astyanax-and-compoundcomposite-keys.html
And, if you want to build an engaging site for your customers, fans or constituents based on live tweets, photos and videos check out Wayin Hub and follow @wayinhub on Twitter.
Dave Johnson in Open Source
01:56PM May 23, 2013
Comments [2]
Tags:
asf
bigdata
cassandra
wayin
wayinhub
This is just one entry in the weblog Blogging Roller. You may want to visit the main page of the weblog
Below are the most recent entries in the category Open Source, some may be related to this entry.
This posting is really helpful for who are new to Cassandra.
Thank you for your posting :-D
Posted by Geunho Khim on October 14, 2013 at 11:29 AM EDT #
Interesting post - I was fascinated by the information , Does anyone know where my assistant can get access to a fillable a form copy to work with ?
Posted by Marjorie Chon on April 04, 2016 at 04:23 AM EDT #