NoSQL with the Cassandra Storage Engine - MariaDB Cookbook (2014)

MariaDB Cookbook (2014)

Chapter 12. NoSQL with the Cassandra Storage Engine

In this chapter, we will cover the following recipes:

· Installing the Cassandra storage engine

· Mapping data between MariaDB and Cassandra

· Using INSERT, UPDATE, and DELETE with the Cassandra storage engine

· Using SELECT with the Cassandra storage engine

Introduction

One unique feature in MariaDB is the Cassandra storage engine. This is a specialized storage engine, similar to the Connect storage engine featured in Chapter 5, The CONNECT Storage Engine. Like Connect, it allows us to access data stored outside of MariaDB. Unlike Connect, the Cassandra storage engine is specific to a certain type of data, namely, it lets us connect MariaDB to a Cassandra cluster.

In this chapter, there are recipes on installing and configuring the Cassandra storage engine, defining tables that use the storage engine to insert, update, delete, and query data.

The Cassandra storage engine in MariaDB is built and packaged only for Linux-based operating systems. As such, the recipes in this chapter assume that we are using a variant of Linux as we complete them.

Installing the Cassandra storage engine

Before we can use the Cassandra storage engine, we need to enable it.

How to do it...

1. On Red Hat, CentOS, and Fedora distributions, we may have to install a separate Cassandra storage engine package with the following command:

2. sudo yum install MariaDB-cassandra-engine

3. Open the mysql command-line client, connect to our MariaDB server as a user with the SUPER privilege and run the following command:

4. INSTALL SONAME 'ha_cassandra';

5. Still connected to our MariaDB server, run the following command:

6. SHOW VARIABLES LIKE "Cassandra%";

7. Add the following code to the [mysqld] section of our my.cnf file:

8. optimizer_switch = 'join_cache_hashed=on'

join_cache_level = 7

How it works...

The Cassandra storage engine is included with MariaDB, but it is not enabled by default. To enable it, we will run the INSTALL SONAME command. This is a one-time operation.

The output of the SHOW VARIABLES command will look like the following screenshot:

How it works...

These variables can be set the same as any other MariaDB variable. They only exist, however, after the Cassandra storage engine has been installed; so we must not add these variables to our my.cnf file until after we enable the storage engine. If we add them before we run the INSTALL SONAME command, MariaDB will refuse to start.

In our recipe, we make two additions to our my.cnf file. These are used because Cassandra supports batched key access in no-association mode, which means that the SQL layer needs to do the key hashing. The settings we added do that.

There's more...

There are also several status variables that we can query after enabling the Cassandra storage engine. Similar to the Cassandra storage engine variables, they are all prefaced with cassandra_ so that we can search for all of them with the following command:

SHOW STATUS LIKE "Cassandra%";

The output of this command will vary based on how much we have used the Cassandra storage engine. For example, after first installing it, the values will all be zeroes as shown in the following screenshot:

There's more...

See also

· The full documentation of the Cassandra storage engine is available at https://mariadb.com/kb/en/cassandra-storage-engine/

· The full documentation of Cassandra is available at http://cassandra.apache.org/

Mapping data between MariaDB and Cassandra

To access data stored in a Cassandra cluster from MariaDB, we create a special table that defines where the Cassandra cluster we want to connect with is located, and how the data stored there should be treated in MariaDB.

Getting ready

We should complete the Installing the Cassandra storage engine recipe before starting this recipe.

Also, before we can complete this recipe, we need to have a running installation of Cassandra that we can connect to from our MariaDB server. The following are the instructions for installing a single-node Cassandra instance on the same host as we are running MariaDB on.

It is also worth noting that the Cassandra storage engine works better with Cassandra version 1.x than with Cassandra 2.0 and later. This is because of changes to the Cassandra data model and the introduction of password-based authentication. These changes will be addressed in future updates to the Cassandra storage engine.

On Red Hat, CentOS, and Fedora, first ensure that either Java 6 or Java 7 is installed. Refer to http://www.datastax.com/documentation/cassandra/1.2/webhelp/cassandra/install/installJreRHEL.html for instructions on downloading and installing the Java JRE. When installed correctly, the java -version command should output something similar to the following screenshot (the version and build numbers will most likely be different):

Getting ready

After Java is installed, create a datastax.repo file at /etc/yum.repos.d/ and add the following code to it:

[datastax]

name = DataStax Repo for Apache Cassandra

baseurl = http://rpm.datastax.com/community

enabled = 1

gpgcheck = 0

We can now install Cassandra with the following command:

sudo yum install cassandra12

After installing, start Cassandra with the following command:

sudo service cassandra start

On Ubuntu and Debian, we will run the following two commands to add the signing key and the Cassandra repository:

sudo apt-key adv --recv-keys \

--keyserver pgp.mit.edu 4BD736A82B5C1B00

sudo add-apt-repository \

'deb http://www.apache.org/dist/cassandra/debian 11x main'

Then, run the following two commands to update APT and install Cassandra:

sudo apt-get update

sudo apt-get install cassandra

We are now ready for this recipe.

How to do it...

1. If we are running Cassandra 1.1, run the cqlsh command and create a Cassandra keyspace called casstest with the following commands:

2. CREATE KEYSPACE casstest WITH

3. strategy_class = 'org.apache.cassandra.locator.SimpleStrategy'

4. AND strategy_options:replication_factor='1';

5. If we are running Cassandra 1.2, run the cqlsh command and create a Cassandra keyspace called casstest with the following commands:

6. CREATE KEYSPACE casstest

7. WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};

8. Still in cqlsh, enter the following commands to create a Cassandra column family called test in the casstest keyspace:

9. USE casstest;

10.CREATE columnfamily test01 (

11. pk varchar primary key,

12. data1 varchar,

13. data2 bigint);

14. Still in cqlsh, enter the following commands to create an additional Cassandra column family called notes:

15.CREATE columnfamily notes (

16. note_id int primary key,

17. note_note text

18.); Java support.exit

19. Open the mysql command-line client and connect to our MariaDB database server and connect to the test database (first creating it if necessary):

20.CREATE DATABASE IF NOT EXISTS test;

21.USE test;

22. Run the following commands to create a table in our test database that maps to the test01 column family we created in step 3:

23.CREATE TABLE test01_cass (

24. pk VARCHAR(36) PRIMARY KEY,

25. data1 VARCHAR(60), Java support.

26. data2 BIGINT

27.) ENGINE=cassandra

28. THRIFT_HOST='localhost'

29. KEYSPACE='casstest'

30. COLUMN_FAMILY='test01';

31. Run the following commands to create a table that maps to the notes column family we created in step 4:

32.CREATE TABLE notes_cass (

33. note_id INT PRIMARY KEY,

34. note_note mediumtext

35.) ENGINE=cassandra

36. DEFAULT CHARSET=utf8

37. THRIFT_HOST='localhost'

38. KEYSPACE='casstest'

39. COLUMN_FAMILY='notes';

How it works...

There are two parts for mapping a table in MariaDB to a corresponding column family in Cassandra. First is the actual definition of the columns. These columns must be named the same as they are named in Cassandra and the data types must be compatible.

The second part is what comes after the table definition. This part is where we specify that we want to use the Cassandra storage engine and the connection parameters. We're connecting to a Cassandra instance hosted on our local server, so we specify localhostin the THRIFT_HOST parameter. The other two parameters that we need to specify are the KEYSPACE and COLUMN_FAMILY parameters we are using. These are casstest and test01, respectively.

There's more...

If our Cassandra instance is configured to use a non-standard port, we can use the THRIFT_PORT parameter to set what it is.

Also, if we set the cassandra_default_thrift_host variable in the [mysqld] section of our my.cnf file, we do not have to specify a THRIFT_HOST parameter (unless, of course, we are connecting to a different host).

Some data types in Cassandra and MariaDB are not directly mapped as they are not equivalent. The following table lists the mapping of Cassandra data types to their MariaDB equivalents:

Cassandra

MariaDB

ascii

BLOB, VARCHAR(n), and use CHARSET=latin1

bigint

BIGINT, TINY, SHORT (use which one fits the actual data in Cassandra)

blob

BLOB, VARBINARY(n)

boolean

BOOL

counter

BIGINT (this value is read-only in MariaDB)

decimal

VARBINARY(n)

double

DOUBLE

float

FLOAT

int

INT

text

BLOB, VARCHAR(n), and use CHARSET=utf8

timestamp

TIMESTAMP (for second precision), TIMESTAMP(6) (for microsecond precision), BIGINT (for the actual 64-bit Cassandra timestamp)

uuid

CHAR(36) (Cassandra UUID values are represented as text in MariaDB)

varint

VARBINARY(n)

Also, size limitations in Cassandra are more relaxed than in MariaDB. For those MariaDB data types specified in the previous table with values such as VARBINARY(n), the value of n should be set large enough to handle whatever values are actually found in our Cassandra database.

We may run into instances where this is not possible. For example, Cassandra has a 2 gigabyte limit on its rowkey length. In MariaDB, the limit for unique key lengths is about 1.5 kilobytes. If the actual data in Cassandra goes beyond MariaDB's limits, it may not be possible to access it from MariaDB.

See also

· The full documentation of the Cassandra storage engine is available at https://mariadb.com/kb/en/cassandra-storage-engine/

· The full documentation of Cassandra is available at http://cassandra.apache.org/

Using INSERT, UPDATE, and DELETE with the Cassandra storage engine

Using a Cassandra storage engine table feels much like using any other table, but there are some important differences. This recipe demonstrates it.

Getting ready

We should complete the Mapping data between MariaDB and Cassandra recipe, before starting this recipe. Also, import the isfdb database as described in the Importing the data exported by mysqldump recipe in Chapter 2, Diving Deep into MariaDB, so that we have some data to use.

How to do it...

1. Open the mysql command-line client and connect to our MariaDB database server and then to the test database.

2. Insert some sample data into the test01_cass table, as follows:

3. INSERT INTO test01_cass VALUES

4. ('rowkey10', 'data1-value', 123456),

5. ('rowkey11', 'data1-value2', 34543),

6. ('rowkey12', 'data1-value3', 444),

7. ('rowkey13', 'data1-value4', 777666555);

8. Fill the empty notes_cass table with data from the isfdb.notes table using the following command:

9. INSERT INTO notes_cass SELECT * FROM isfdb.notes;

10. Update the test01_cass data2 value for rowkey12 to 444, using the following command:

11.UPDATE test01_cass SET data2=454 WHERE pk='rowkey12';

12. Delete a row from the test01_cass table using the following command:

13.DELETE FROM test01_cass WHERE pk = 'rowkey13';

14. Exit the mysql command-line client, launch the cqlsh client, and run the following commands:

15.USE casstest;

16.SELECT * FROM test01;

How it works...

As long as proper care has been taken when creating the Cassandra storage engine tables in MariaDB so that the data types are compatible (if not equal), the INSERT and UPDATE operations appear to work as we would expect. We can even perform the INSERT INTO ... SELECT FROM ... operations to move data from an InnoDB, Aria, or other standard MariaDB table into a Cassandra storage engine table and vice versa.

However, inserting values into Cassandra tables actually function as INSERT or UPDATE style statements. This is because of the way in which Cassandra's data model works. Rows can and will be silently overwritten if, for example, the primary keys match. This is how Cassandra is supposed to work, so it is not an error. We just need to be aware of it when using the Cassandra storage engine.

In the last step of the recipe, we switch over to the cqlsh client to show that the data we added is in our Cassandra database. The output of this step will look similar to the following screenshot:

How it works...

There's more...

Cassandra has a feature that allows individual rows to have their own sets of columns. These columns can be accessed using MariaDB's dynamic columns feature. To do so, when we define our Cassandra storage engine table, we just need to define a BLOB column with the DYNAMIC_COLUMN_STORAGE=yes attribute. Refer to the recipes related to dynamic columns in Chapter 10, Exploring Dynamic and Virtual Columns in MariaDB for more information.

See also

· The full documentation of the Cassandra storage engine is available at https://mariadb.com/kb/en/cassandra-storage-engine/

· The full documentation of Cassandra is available at http://cassandra.apache.org/

Using SELECT with the Cassandra storage engine

As with the previous recipe, the SELECT statements are much the same when using the Cassandra storage engine tables.

Getting ready

First, we need to complete the Using INSERT, UPDATE, and DELETE with the Cassandra storage engine recipe.

How to do it...

1. Open the mysql command-line client and connect to our MariaDB database server and the test database.

2. Select everything from the test01_cass table using the following command:

3. SELECT * FROM test01_cass;

4. Select ten rows from the notes_cass table using the following command:

5. SELECT * FROM notes_cass LIMIT 10;

6. Select data with multiple WHERE clauses, an ORDER BY clause, and a LIMIT clause using the following commands:

7. SELECT * FROM notes_cass

8. WHERE note_note IS NOT NULL

9. AND note_id < 500

10. AND LENGTH(note_note) < 30

11. ORDER BY note_id DESC

12. LIMIT 10;

13. Join the notes_cass table in the test database to the publishers table in the isfdb database with some WHERE clauses and a LIMIT clause, using the following commands:

14.SELECT publisher_name,publisher_wikipedia,note_note

15. FROM isfdb.publishers INNER JOIN notes_cass

16. USING (note_id)

17. WHERE note_note IS NOT NULL

18. AND publisher_wikipedia IS NOT NULL

19. AND LENGTH(note_note) < 30

20. AND LENGTH(publisher_wikipedia) < 40

21. LIMIT 10;

How it works...

The SELECT statements for tables that use the Cassandra storage engine are much the same as the other SELECT statements. The main difference is that when the query is actually run, the Cassandra storage engine connects to a Cassandra cluster to fetch the data we are asking for instead of to a regular table on the local filesystem. We can use the LIMIT, WHERE, and other clauses to refine our results, and even join our data to other tables, just as if it was a regular MariaDB table.

The SELECT statement from step 5 of our recipe will look something like the following screenshot (the actual results may be different depending on the version of the isfdb database you are using):

How it works...

There's more...

A big problem with NoSQL databases such as Cassandra is that they simply do not have easy ways to do relational-database-style things such as JOINs. This is a big reason why the Cassandra storage engine was created. Using the Cassandra storage engine lets us not only perform JOINs between the data stored in MariaDB and the data stored in a Cassandra cluster, but it also enables us to do so between two or more Cassandra clusters, keyspaces, or column families.

That said, the Cassandra storage engine is not really suitable for running analytics-type queries that sift through large amounts of data stored in a Cassandra cluster. There are plenty of excellent tools on the Cassandra side (such as Apache Hive or Apache Pig), which are designed for just those sorts of things. The Cassandra storage engine is merely an easy-to-use, convenient window from a SQL environment (MariaDB) into a NoSQL environment (Cassandra).

We should also be careful with the complex SELECT statements. A query that does a full table scan, for example, may work fine when all of the tables use the InnoDB or MyISAM storage engines, but they can take forever when a Cassandra storage engine table is included (for example, the complex SELECT statement from the Using SHOW EXPLAIN with running queries recipe in Chapter 2, Diving Deep into MariaDB).

See also

· The full documentation of the Cassandra storage engine is available at https://mariadb.com/kb/en/cassandra-storage-engine/

· The full documentation of Cassandra is available at http://cassandra.apache.org/

· More details on how the Cassandra storage engine handles JOINs can be found at https://mariadb.com/kb/en/how-are-joins-handled-with-cassandra/